2025-05-07T20:23:13.3944892Z Current runner version: '2.323.0' 2025-05-07T20:23:13.3950413Z Runner name: 'i-0b68a33264ad7b273' 2025-05-07T20:23:13.3951395Z Machine name: 'ip-10-0-14-174' 2025-05-07T20:23:13.3954176Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:13.3956484Z Contents: read 2025-05-07T20:23:13.3957026Z Metadata: read 2025-05-07T20:23:13.3957534Z Packages: read 2025-05-07T20:23:13.3958042Z ##[endgroup] 2025-05-07T20:23:13.3959988Z Secret source: None 2025-05-07T20:23:13.3960643Z Prepare workflow directory 2025-05-07T20:23:13.4884742Z Prepare all required actions 2025-05-07T20:23:13.4924176Z Getting action download info 2025-05-07T20:23:13.7038331Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:13.9891981Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:14.3999816Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:16.0056498Z Getting action download info 2025-05-07T20:23:16.1091791Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:16.3112390Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.6.3, 12.6.3, gcc) 2025-05-07T20:23:16.3728880Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:16.3862526Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:16.3875121Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:16.3876921Z ##[endgroup] 2025-05-07T20:23:17.4974099Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:17.4974539Z Instance Type: g5.4xlarge 2025-05-07T20:23:17.4974784Z AMI Name: unknown 2025-05-07T20:23:17.5012706Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:22.9017110Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:22.9017416Z with: 2025-05-07T20:23:22.9017662Z submodules: true 2025-05-07T20:23:22.9017914Z repository: pytorch/FBGEMM 2025-05-07T20:23:22.9018298Z token: *** 2025-05-07T20:23:22.9018505Z ssh-strict: true 2025-05-07T20:23:22.9018712Z ssh-user: git 2025-05-07T20:23:22.9018937Z persist-credentials: true 2025-05-07T20:23:22.9019180Z clean: true 2025-05-07T20:23:22.9019410Z sparse-checkout-cone-mode: true 2025-05-07T20:23:22.9019676Z fetch-depth: 1 2025-05-07T20:23:22.9019891Z fetch-tags: false 2025-05-07T20:23:22.9020110Z show-progress: true 2025-05-07T20:23:22.9020325Z lfs: false 2025-05-07T20:23:22.9020538Z set-safe-directory: true 2025-05-07T20:23:22.9020786Z env: 2025-05-07T20:23:22.9021001Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:22.9021295Z BUILD_ENV: build_binary 2025-05-07T20:23:22.9021553Z BUILD_TARGET: genai 2025-05-07T20:23:22.9021770Z BUILD_VARIANT: cuda 2025-05-07T20:23:22.9022027Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:22.9022275Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:22.9022508Z ##[endgroup] 2025-05-07T20:23:23.0189936Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:23.0191253Z ##[group]Getting Git version info 2025-05-07T20:23:23.0191824Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:23.0192606Z [command]/usr/bin/git version 2025-05-07T20:23:23.0192943Z git version 2.47.1 2025-05-07T20:23:23.0208952Z ##[endgroup] 2025-05-07T20:23:23.0219362Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/d4e0d646-646d-4635-b7a6-4aac06c6045d/.gitconfig' 2025-05-07T20:23:23.0229496Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/d4e0d646-646d-4635-b7a6-4aac06c6045d' before making global git config changes 2025-05-07T20:23:23.0230472Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:23.0243041Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:23.0289710Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:23.0315300Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:23.0333825Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:23.0337744Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:23.0363689Z refs/heads/main 2025-05-07T20:23:23.0374173Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:23.9071531Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:23.9123905Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:23.9155296Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:23.9160483Z ##[endgroup] 2025-05-07T20:23:23.9163996Z [command]/usr/bin/git submodule status 2025-05-07T20:23:23.9586910Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:23.9671615Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:23.9759703Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:23.9848541Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:23.9934064Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:24.0019285Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:24.0102492Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:24.0116999Z ##[group]Cleaning the repository 2025-05-07T20:23:24.0121932Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:24.0180398Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:24.0294255Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:24.0302071Z ##[endgroup] 2025-05-07T20:23:24.0304281Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:24.0308683Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:24.0341684Z ##[endgroup] 2025-05-07T20:23:24.0342142Z ##[group]Setting up auth 2025-05-07T20:23:24.0347822Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:24.0390668Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:24.0724863Z Entering 'external/asmjit' 2025-05-07T20:23:24.0791039Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.0863547Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.0932115Z Entering 'external/cutlass' 2025-05-07T20:23:24.1005841Z Entering 'external/googletest' 2025-05-07T20:23:24.1071075Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.1138490Z Entering 'external/json' 2025-05-07T20:23:24.1224844Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:24.1258953Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:24.1591484Z Entering 'external/asmjit' 2025-05-07T20:23:24.1658187Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.1730866Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.1797551Z Entering 'external/cutlass' 2025-05-07T20:23:24.1874404Z Entering 'external/googletest' 2025-05-07T20:23:24.1941202Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.2006939Z Entering 'external/json' 2025-05-07T20:23:24.2093185Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:24.2147773Z ##[endgroup] 2025-05-07T20:23:24.2148785Z ##[group]Fetching the repository 2025-05-07T20:23:24.2156288Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:24.4546439Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:24.4547239Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:24.4573863Z ##[endgroup] 2025-05-07T20:23:24.4574628Z ##[group]Determining the checkout info 2025-05-07T20:23:24.4576694Z ##[endgroup] 2025-05-07T20:23:24.4581700Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:24.4634927Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:24.4665421Z ##[group]Checking out the ref 2025-05-07T20:23:24.4669659Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:24.4799448Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:24.4802513Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:24.4812624Z ##[endgroup] 2025-05-07T20:23:24.4813151Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:24.4818334Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:24.4868699Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:24.4899477Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:24.4930425Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:24.4959150Z ##[endgroup] 2025-05-07T20:23:24.4959782Z ##[group]Fetching submodules 2025-05-07T20:23:24.4962566Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:24.5337647Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:24.5338775Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:24.5340132Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:24.5340907Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:24.5341662Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:24.5342189Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:24.5342643Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:24.5354337Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:24.5779383Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:24.5927379Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:24.6028022Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:24.6195117Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:24.6283584Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:24.6365231Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:24.6469864Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:24.6488253Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:24.6825452Z Entering 'external/asmjit' 2025-05-07T20:23:24.6858113Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.6890122Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.6923236Z Entering 'external/cutlass' 2025-05-07T20:23:24.6955471Z Entering 'external/googletest' 2025-05-07T20:23:24.6987642Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.7020254Z Entering 'external/json' 2025-05-07T20:23:24.7065070Z ##[endgroup] 2025-05-07T20:23:24.7065579Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:24.7070729Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:24.7401871Z Entering 'external/asmjit' 2025-05-07T20:23:24.7444480Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7444977Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7487394Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.7531398Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7556115Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7583375Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.7627749Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7628074Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7671070Z Entering 'external/cutlass' 2025-05-07T20:23:24.7712907Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7713220Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7764731Z Entering 'external/googletest' 2025-05-07T20:23:24.7808864Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7809202Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7851695Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.7895329Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7895658Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7937788Z Entering 'external/json' 2025-05-07T20:23:24.7979934Z url.https://github.com/.insteadof 2025-05-07T20:23:24.7980264Z url.https://github.com/.insteadof 2025-05-07T20:23:24.8043397Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:24.8377929Z Entering 'external/asmjit' 2025-05-07T20:23:24.8441440Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:24.8444207Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.8507835Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:24.8509080Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.8570749Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:24.8574223Z Entering 'external/cutlass' 2025-05-07T20:23:24.8635399Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:24.8638751Z Entering 'external/googletest' 2025-05-07T20:23:24.8700647Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:24.8703937Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.8765927Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:24.8768676Z Entering 'external/json' 2025-05-07T20:23:24.8829784Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:24.8953492Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:24.9289191Z Entering 'external/asmjit' 2025-05-07T20:23:24.9322002Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.9354726Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.9385821Z Entering 'external/cutlass' 2025-05-07T20:23:24.9419599Z Entering 'external/googletest' 2025-05-07T20:23:24.9450625Z Entering 'external/hipify_torch' 2025-05-07T20:23:24.9482496Z Entering 'external/json' 2025-05-07T20:23:24.9530375Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:24.9860722Z Entering 'external/asmjit' 2025-05-07T20:23:24.9893491Z Entering 'external/composable_kernel' 2025-05-07T20:23:24.9925914Z Entering 'external/cpuinfo' 2025-05-07T20:23:24.9956862Z Entering 'external/cutlass' 2025-05-07T20:23:24.9987581Z Entering 'external/googletest' 2025-05-07T20:23:25.0019079Z Entering 'external/hipify_torch' 2025-05-07T20:23:25.0052450Z Entering 'external/json' 2025-05-07T20:23:25.0096003Z ##[endgroup] 2025-05-07T20:23:25.0138800Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:25.0163050Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:25.0353882Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:25.0354190Z with: 2025-05-07T20:23:25.0354423Z name: fbgemm_genai_x86_gcc_py3.13_cu12.6.3.whl 2025-05-07T20:23:25.0354735Z merge-multiple: false 2025-05-07T20:23:25.0354978Z repository: pytorch/FBGEMM 2025-05-07T20:23:25.0355225Z run-id: 14891846252 2025-05-07T20:23:25.0355426Z env: 2025-05-07T20:23:25.0355642Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:25.0355926Z BUILD_ENV: build_binary 2025-05-07T20:23:25.0356157Z BUILD_TARGET: genai 2025-05-07T20:23:25.0356371Z BUILD_VARIANT: cuda 2025-05-07T20:23:25.0356605Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:25.0356841Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:25.0357072Z ##[endgroup] 2025-05-07T20:23:25.2716231Z Downloading single artifact 2025-05-07T20:23:25.3710390Z Preparing to download the following artifacts: 2025-05-07T20:23:25.3711204Z - fbgemm_genai_x86_gcc_py3.13_cu12.6.3.whl (ID: 3081362642, Size: 12512725, Expected Digest: sha256:228c0da92693d2954cf116c01d25e7cc680533513556b331a58d6b7834b2e3d4) 2025-05-07T20:23:25.4211348Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-d2ebcb72-c99d-5c1c-9db7-78599d6c6d28/artifacts/4c58965a6bbd4d44222979263dfcdea5bd55f581a5885da24be5168ea14aaaab.zip 2025-05-07T20:23:25.4212744Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:25.5048221Z (node:245835) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:25.5049166Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:25.7225302Z SHA256 digest of downloaded artifact is 228c0da92693d2954cf116c01d25e7cc680533513556b331a58d6b7834b2e3d4 2025-05-07T20:23:25.7225886Z Artifact download completed successfully. 2025-05-07T20:23:25.7226223Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:25.7231699Z Download artifact has finished successfully 2025-05-07T20:23:25.7475366Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:25.7475743Z with: 2025-05-07T20:23:25.7475963Z driver-version: 570.133.07 2025-05-07T20:23:25.7476206Z env: 2025-05-07T20:23:25.7476421Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:25.7476710Z BUILD_ENV: build_binary 2025-05-07T20:23:25.7476942Z BUILD_TARGET: genai 2025-05-07T20:23:25.7477175Z BUILD_VARIANT: cuda 2025-05-07T20:23:25.7477399Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:25.7477648Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:25.7477881Z ##[endgroup] 2025-05-07T20:23:25.7571929Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:25.7572305Z with: 2025-05-07T20:23:25.7572521Z timeout_minutes: 10 2025-05-07T20:23:25.7572748Z max_attempts: 3 2025-05-07T20:23:25.7595334Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:25.7618445Z retry_wait_seconds: 10 2025-05-07T20:23:25.7618701Z polling_interval_seconds: 1 2025-05-07T20:23:25.7618950Z warning_on_retry: true 2025-05-07T20:23:25.7619188Z continue_on_error: false 2025-05-07T20:23:25.7619428Z env: 2025-05-07T20:23:25.7619635Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:25.7619930Z BUILD_ENV: build_binary 2025-05-07T20:23:25.7632680Z BUILD_TARGET: genai 2025-05-07T20:23:25.7632922Z BUILD_VARIANT: cuda 2025-05-07T20:23:25.7633159Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:25.7633412Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:25.7633652Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:25.7633884Z ##[endgroup] 2025-05-07T20:23:26.6536998Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:26.6539237Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:26.6539755Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:26.9476564Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:26.9477584Z No packages marked for removal. 2025-05-07T20:23:26.9541023Z Dependencies resolved. 2025-05-07T20:23:26.9550634Z Nothing to do. 2025-05-07T20:23:26.9550984Z Complete! 2025-05-07T20:23:26.9901181Z + install_nvidia_driver_common 2025-05-07T20:23:26.9905047Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:26.9905472Z + lspci 2025-05-07T20:23:26.9907736Z Before installing NVIDIA driver 2025-05-07T20:23:27.0027083Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:27.0027817Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:27.0028359Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:27.0028983Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:27.0029673Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:27.0030439Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:27.0030926Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:27.0031398Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:27.0031827Z + lsmod 2025-05-07T20:23:27.0074694Z Module Size Used by 2025-05-07T20:23:27.0075300Z xt_nat 16384 0 2025-05-07T20:23:27.0075812Z nvidia_modeset 1716224 0 2025-05-07T20:23:27.0076355Z video 65536 1 nvidia_modeset 2025-05-07T20:23:27.0076956Z wmi 36864 1 video 2025-05-07T20:23:27.0077495Z nvidia_uvm 1884160 0 2025-05-07T20:23:27.0078086Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:27.0078714Z drm 602112 1 nvidia 2025-05-07T20:23:27.0079306Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:27.0080032Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:27.0080705Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:27.0081264Z veth 36864 0 2025-05-07T20:23:27.0081773Z xt_conntrack 16384 1 2025-05-07T20:23:27.0082226Z nft_chain_nat 16384 3 2025-05-07T20:23:27.0082488Z xt_MASQUERADE 20480 1 2025-05-07T20:23:27.0083008Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:27.0083347Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:27.0083772Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:27.0084233Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:27.0084548Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:27.0084836Z xfrm_user 57344 1 2025-05-07T20:23:27.0085105Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:27.0085397Z xt_addrtype 16384 2 2025-05-07T20:23:27.0085648Z nft_compat 20480 4 2025-05-07T20:23:27.0085951Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:27.0086361Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:27.0086729Z br_netfilter 36864 0 2025-05-07T20:23:27.0087008Z bridge 323584 1 br_netfilter 2025-05-07T20:23:27.0087309Z stp 16384 1 bridge 2025-05-07T20:23:27.0087594Z llc 16384 2 bridge,stp 2025-05-07T20:23:27.0087871Z overlay 167936 0 2025-05-07T20:23:27.0088122Z tls 135168 0 2025-05-07T20:23:27.0088378Z nls_ascii 16384 1 2025-05-07T20:23:27.0088624Z nls_cp437 20480 1 2025-05-07T20:23:27.0088874Z vfat 24576 1 2025-05-07T20:23:27.0089125Z fat 86016 1 vfat 2025-05-07T20:23:27.0089387Z sunrpc 696320 1 2025-05-07T20:23:27.0089640Z ena 180224 0 2025-05-07T20:23:27.0089881Z i8042 45056 0 2025-05-07T20:23:27.0090130Z serio 28672 3 i8042 2025-05-07T20:23:27.0090406Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:27.0090666Z button 24576 0 2025-05-07T20:23:27.0090915Z sch_fq_codel 20480 17 2025-05-07T20:23:27.0091174Z fuse 163840 1 2025-05-07T20:23:27.0091426Z dm_mod 188416 0 2025-05-07T20:23:27.0091681Z configfs 57344 1 2025-05-07T20:23:27.0091931Z dax 45056 1 dm_mod 2025-05-07T20:23:27.0092343Z loop 36864 0 2025-05-07T20:23:27.0092628Z dmi_sysfs 20480 0 2025-05-07T20:23:27.0092985Z crc32_pclmul 16384 0 2025-05-07T20:23:27.0093240Z crc32c_intel 24576 0 2025-05-07T20:23:27.0093493Z efivarfs 24576 1 2025-05-07T20:23:27.0093880Z + modinfo nvidia 2025-05-07T20:23:27.0094250Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:27.0094688Z import_ns: DMA_BUF 2025-05-07T20:23:27.0094939Z alias: char-major-195-* 2025-05-07T20:23:27.0095214Z version: 570.133.07 2025-05-07T20:23:27.0095459Z supported: external 2025-05-07T20:23:27.0095709Z license: Dual MIT/GPL 2025-05-07T20:23:27.0095999Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:27.0096339Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:27.0096662Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:27.0096995Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:27.0097328Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:27.0097667Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:27.0097979Z depends: i2c-core,drm 2025-05-07T20:23:27.0098400Z retpoline: Y 2025-05-07T20:23:27.0098620Z name: nvidia 2025-05-07T20:23:27.0098977Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:27.0099444Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:27.0099878Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:27.0100293Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:27.0100599Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:27.0100890Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:27.0101344Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:27.0101647Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:27.0101944Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:27.0102306Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:27.0102689Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:27.0103018Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:27.0103313Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:27.0103619Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:27.0103978Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:27.0104364Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:27.0104739Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:27.0105149Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.0105546Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:27.0105968Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.0106371Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:27.0106710Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:27.0107073Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:27.0107440Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:27.0107778Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:27.0108089Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:27.0108417Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:27.0108738Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:27.0109039Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:27.0109387Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:27.0109744Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:27.0110087Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:27.0110412Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:27.0110762Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:27.0111095Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:27.0111426Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:27.0111894Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:27.0112189Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:27.0112503Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:27.0112824Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:27.0113138Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:27.0113455Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:27.0113809Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:27.0114160Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:27.0114481Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:27.0114816Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:27.0115155Z parm: rm_firmware_active:charp 2025-05-07T20:23:27.0115451Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:27.0115686Z ++ command -v nvidia-smi 2025-05-07T20:23:27.0115944Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:27.0116200Z + set +e 2025-05-07T20:23:27.0116503Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:27.0344260Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:27.0344563Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:27.0344791Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:27.0344997Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:27.0345256Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:27.0345666Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:27.0346109Z + set -e 2025-05-07T20:23:27.0346299Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:27.0346672Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:27.0347123Z + post_install_nvidia_driver_common 2025-05-07T20:23:27.0350848Z + sudo modprobe nvidia 2025-05-07T20:23:27.1907841Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:27.1908274Z + lspci 2025-05-07T20:23:27.1908572Z After installing NVIDIA driver 2025-05-07T20:23:27.2023451Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:27.2024111Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:27.2024720Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:27.2025426Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:27.2026062Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:27.2026605Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:27.2027078Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:27.2027543Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:27.2027952Z + lsmod 2025-05-07T20:23:27.2057150Z Module Size Used by 2025-05-07T20:23:27.2057590Z xt_nat 16384 0 2025-05-07T20:23:27.2057931Z nvidia_modeset 1716224 0 2025-05-07T20:23:27.2058320Z video 65536 1 nvidia_modeset 2025-05-07T20:23:27.2058693Z wmi 36864 1 video 2025-05-07T20:23:27.2058965Z nvidia_uvm 1884160 0 2025-05-07T20:23:27.2059270Z nvidia 11583488 7 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:27.2059681Z drm 602112 1 nvidia 2025-05-07T20:23:27.2059989Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:27.2060352Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:27.2060702Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:27.2060991Z veth 36864 0 2025-05-07T20:23:27.2061244Z xt_conntrack 16384 1 2025-05-07T20:23:27.2061503Z nft_chain_nat 16384 3 2025-05-07T20:23:27.2061776Z xt_MASQUERADE 20480 1 2025-05-07T20:23:27.2062124Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:27.2062471Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:27.2063477Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:27.2063942Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:27.2064249Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:27.2064545Z xfrm_user 57344 1 2025-05-07T20:23:27.2064812Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:27.2065093Z xt_addrtype 16384 2 2025-05-07T20:23:27.2065351Z nft_compat 20480 4 2025-05-07T20:23:27.2065652Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:27.2066048Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:27.2066421Z br_netfilter 36864 0 2025-05-07T20:23:27.2066702Z bridge 323584 1 br_netfilter 2025-05-07T20:23:27.2066999Z stp 16384 1 bridge 2025-05-07T20:23:27.2067276Z llc 16384 2 bridge,stp 2025-05-07T20:23:27.2067558Z overlay 167936 0 2025-05-07T20:23:27.2067815Z tls 135168 0 2025-05-07T20:23:27.2068063Z nls_ascii 16384 1 2025-05-07T20:23:27.2068314Z nls_cp437 20480 1 2025-05-07T20:23:27.2068561Z vfat 24576 1 2025-05-07T20:23:27.2068807Z fat 86016 1 vfat 2025-05-07T20:23:27.2069074Z sunrpc 696320 1 2025-05-07T20:23:27.2069325Z ena 180224 0 2025-05-07T20:23:27.2069571Z i8042 45056 0 2025-05-07T20:23:27.2069822Z serio 28672 3 i8042 2025-05-07T20:23:27.2070099Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:27.2070353Z button 24576 0 2025-05-07T20:23:27.2070606Z sch_fq_codel 20480 17 2025-05-07T20:23:27.2071020Z fuse 163840 1 2025-05-07T20:23:27.2071269Z dm_mod 188416 0 2025-05-07T20:23:27.2071512Z configfs 57344 1 2025-05-07T20:23:27.2071774Z dax 45056 1 dm_mod 2025-05-07T20:23:27.2072049Z loop 36864 0 2025-05-07T20:23:27.2072297Z dmi_sysfs 20480 0 2025-05-07T20:23:27.2072550Z crc32_pclmul 16384 0 2025-05-07T20:23:27.2072807Z crc32c_intel 24576 0 2025-05-07T20:23:27.2073053Z efivarfs 24576 1 2025-05-07T20:23:27.2073302Z + modinfo nvidia 2025-05-07T20:23:27.2073925Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:27.2074423Z import_ns: DMA_BUF 2025-05-07T20:23:27.2074684Z alias: char-major-195-* 2025-05-07T20:23:27.2074960Z version: 570.133.07 2025-05-07T20:23:27.2075211Z supported: external 2025-05-07T20:23:27.2075460Z license: Dual MIT/GPL 2025-05-07T20:23:27.2075749Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:27.2076093Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:27.2076406Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:27.2076726Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:27.2077067Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:27.2077398Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:27.2077699Z depends: i2c-core,drm 2025-05-07T20:23:27.2077957Z retpoline: Y 2025-05-07T20:23:27.2078175Z name: nvidia 2025-05-07T20:23:27.2078521Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:27.2078989Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:27.2079424Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:27.2079834Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:27.2080140Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:27.2080443Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:27.2080755Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:27.2081046Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:27.2081346Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:27.2081812Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:27.2082192Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:27.2082521Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:27.2082818Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:27.2083115Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:27.2083474Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:27.2083867Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:27.2084239Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:27.2084649Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.2085052Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:27.2085470Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:27.2085870Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:27.2086206Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:27.2086571Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:27.2086935Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:27.2087278Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:27.2087600Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:27.2087921Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:27.2088244Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:27.2088552Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:27.2088898Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:27.2089251Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:27.2089581Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:27.2091282Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:27.2091620Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:27.2091961Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:27.2092306Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:27.2092632Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:27.2092920Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:27.2093245Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:27.2093568Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:27.2093986Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:27.2094314Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:27.2094674Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:27.2095015Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:27.2095341Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:27.2095690Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:27.2096033Z parm: rm_firmware_active:charp 2025-05-07T20:23:27.2096318Z + set +e 2025-05-07T20:23:27.2096513Z + nvidia-smi 2025-05-07T20:23:27.2253971Z Wed May 7 20:23:27 2025 2025-05-07T20:23:27.2254761Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.2255932Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:27.2256879Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:27.2257843Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:27.2258870Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:27.2259709Z | | | MIG M. | 2025-05-07T20:23:27.2260372Z |=========================================+========================+======================| 2025-05-07T20:23:27.2430004Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:27.2430753Z | 0% 29C P8 24W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:27.2431140Z | | | N/A | 2025-05-07T20:23:27.2431534Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:27.2435164Z 2025-05-07T20:23:27.2436086Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.2436856Z | Processes: | 2025-05-07T20:23:27.2437665Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:27.2438423Z | ID ID Usage | 2025-05-07T20:23:27.2439059Z |=========================================================================================| 2025-05-07T20:23:27.2440664Z | No running processes found | 2025-05-07T20:23:27.2441640Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:27.4793662Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:27.4965001Z NVIDIA A10G 2025-05-07T20:23:27.5008033Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:27.5008370Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:27.5008700Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:27.5009001Z + set -e 2025-05-07T20:23:27.5009218Z INFO: Ignoring allowed status 0 2025-05-07T20:23:27.5066081Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:27.5066641Z + sudo yum install -y yum-utils 2025-05-07T20:23:27.9091326Z Last metadata expiration check: 0:54:00 ago on Wed May 7 19:29:27 2025. 2025-05-07T20:23:27.9340418Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:27.9739076Z Dependencies resolved. 2025-05-07T20:23:27.9922128Z Nothing to do. 2025-05-07T20:23:27.9922461Z Complete! 2025-05-07T20:23:28.0321296Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:28.0321892Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.0323513Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.2798730Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:28.3385484Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:28.8724774Z nvidia-container-toolkit 13 kB/s | 833 B 00:00 2025-05-07T20:23:28.8984166Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:28.9389718Z Dependencies resolved. 2025-05-07T20:23:28.9568070Z ================================================================================ 2025-05-07T20:23:28.9568585Z Package Arch Version Repository Size 2025-05-07T20:23:28.9568962Z ================================================================================ 2025-05-07T20:23:28.9569271Z Downgrading: 2025-05-07T20:23:28.9569634Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:28.9570206Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:28.9570566Z 2025-05-07T20:23:28.9570661Z Transaction Summary 2025-05-07T20:23:28.9570912Z ================================================================================ 2025-05-07T20:23:28.9571216Z Downgrade 2 Packages 2025-05-07T20:23:28.9571364Z 2025-05-07T20:23:28.9571481Z Total download size: 6.8 M 2025-05-07T20:23:28.9572246Z Downloading Packages: 2025-05-07T20:23:29.0426014Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64 15 MB/s | 1.2 MB 00:00 2025-05-07T20:23:29.1416959Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x 31 MB/s | 5.6 MB 00:00 2025-05-07T20:23:29.1425496Z -------------------------------------------------------------------------------- 2025-05-07T20:23:29.1428374Z Total 37 MB/s | 6.8 MB 00:00 2025-05-07T20:23:29.1431372Z Running transaction check 2025-05-07T20:23:29.1536494Z Transaction check succeeded. 2025-05-07T20:23:29.1537243Z Running transaction test 2025-05-07T20:23:29.1833965Z Transaction test succeeded. 2025-05-07T20:23:29.1836330Z Running transaction 2025-05-07T20:23:29.7347750Z Preparing : 1/1 2025-05-07T20:23:29.8421602Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:29.8459128Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:29.8690538Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:29.8691299Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:29.8803517Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:29.8831517Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:30.0740001Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:30.0741155Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:30.0742216Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:30.0742991Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:30.2079504Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:30.2080070Z WARNING: 2025-05-07T20:23:30.2080315Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:30.2080572Z 2025-05-07T20:23:30.2080674Z Available Versions: 2025-05-07T20:23:30.2080821Z 2025-05-07T20:23:30.2080917Z Version 2023.7.20250331: 2025-05-07T20:23:30.2081228Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:30.2081482Z 2025-05-07T20:23:30.2081606Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:30.2081815Z 2025-05-07T20:23:30.2081906Z Release notes: 2025-05-07T20:23:30.2082314Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:30.2082696Z 2025-05-07T20:23:30.2082806Z Version 2023.7.20250414: 2025-05-07T20:23:30.2083137Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:30.2083391Z 2025-05-07T20:23:30.2083513Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:30.2083717Z 2025-05-07T20:23:30.2083804Z Release notes: 2025-05-07T20:23:30.2084199Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:30.2084555Z 2025-05-07T20:23:30.2084656Z Version 2023.7.20250428: 2025-05-07T20:23:30.2084954Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:30.2085206Z 2025-05-07T20:23:30.2085321Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:30.2085533Z 2025-05-07T20:23:30.2085617Z Release notes: 2025-05-07T20:23:30.2086004Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:30.2086355Z 2025-05-07T20:23:30.2086469Z ================================================================================ 2025-05-07T20:23:30.2444197Z 2025-05-07T20:23:30.2444348Z 2025-05-07T20:23:30.2444434Z Downgraded: 2025-05-07T20:23:30.2444814Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:30.2445378Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:30.2445717Z 2025-05-07T20:23:30.2445800Z Complete! 2025-05-07T20:23:30.2899990Z + sudo systemctl restart docker 2025-05-07T20:23:33.2674923Z nvidia-persistenced failed to initialize. Check syslog for more details. 2025-05-07T20:23:33.2872266Z Wed May 7 20:23:33 2025 2025-05-07T20:23:33.2873026Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:33.2873701Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:33.2874168Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:33.2874650Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:33.2875161Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:33.2875611Z | | | MIG M. | 2025-05-07T20:23:33.2875941Z |=========================================+========================+======================| 2025-05-07T20:23:33.3008124Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:33.3008563Z | 0% 29C P8 24W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:33.3008937Z | | | N/A | 2025-05-07T20:23:33.3009320Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:33.3011591Z 2025-05-07T20:23:33.3011997Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:33.3012817Z | Processes: | 2025-05-07T20:23:33.3013243Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:33.3013794Z | ID ID Usage | 2025-05-07T20:23:33.3014137Z |=========================================================================================| 2025-05-07T20:23:33.3017770Z | No running processes found | 2025-05-07T20:23:33.3018237Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:33.8155590Z Command completed after 1 attempt(s). 2025-05-07T20:23:33.8242280Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:33.8242719Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:33.8256347Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:33.8256707Z env: 2025-05-07T20:23:33.8256932Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:33.8257232Z BUILD_ENV: build_binary 2025-05-07T20:23:33.8257478Z BUILD_TARGET: genai 2025-05-07T20:23:33.8257711Z BUILD_VARIANT: cuda 2025-05-07T20:23:33.8257940Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:33.8258195Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:33.8258498Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:33.8258821Z ##[endgroup] 2025-05-07T20:23:34.1626961Z ################################################################################ 2025-05-07T20:23:34.1627423Z # Print System Info 2025-05-07T20:23:34.1627711Z # 2025-05-07T20:23:34.1644260Z # [2025-05-07T20:23:34.164Z] + print_system_info 2025-05-07T20:23:34.1644608Z ################################################################################ 2025-05-07T20:23:34.1644823Z 2025-05-07T20:23:34.1644937Z ################################################################################ 2025-05-07T20:23:34.1645272Z [INFO] Printing environment variables ... 2025-05-07T20:23:34.1645567Z + printenv 2025-05-07T20:23:34.1645723Z 2025-05-07T20:23:34.1655151Z SHELL=/bin/bash 2025-05-07T20:23:34.1655495Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:34.1655910Z BUILD_VARIANT=cuda 2025-05-07T20:23:34.1656424Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_78ba5815-fdf5-45ec-beb7-0271d86c1f0b 2025-05-07T20:23:34.1656985Z GITHUB_ACTION=__run 2025-05-07T20:23:34.1657264Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:34.1657595Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:34.1657841Z RUNNER_NAME=i-0b68a33264ad7b273 2025-05-07T20:23:34.1658115Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:34.1658418Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:34.1658678Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:34.1659033Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:34.1659453Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:34.1659730Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:34.1660013Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:34.1660457Z *** 2025-05-07T20:23:34.1660651Z LOGNAME=ec2-user 2025-05-07T20:23:34.1660879Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:34.1661129Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:34.1661356Z GITHUB_ACTIONS=true 2025-05-07T20:23:34.1661572Z SYSTEMD_EXEC_PID=55528 2025-05-07T20:23:34.1661837Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:34.1662372Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:34.1662873Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:34.1663139Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:34.1663392Z RUNNER_OS=Linux 2025-05-07T20:23:34.1663611Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:34.1663845Z HOME=/home/ec2-user 2025-05-07T20:23:34.1664095Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:34.1664948Z LANG=C.UTF-8 2025-05-07T20:23:34.1665699Z RUNNER_TRACKING_ID=github_85c37a8c-042b-4f5a-98d5-bf97741633f7 2025-05-07T20:23:34.1666043Z RUNNER_ARCH=X64 2025-05-07T20:23:34.1666313Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:34.1666632Z BUILD_TARGET=genai 2025-05-07T20:23:34.1667139Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_78ba5815-fdf5-45ec-beb7-0271d86c1f0b 2025-05-07T20:23:34.1667983Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_78ba5815-fdf5-45ec-beb7-0271d86c1f0b 2025-05-07T20:23:34.1668698Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:34.1669349Z INVOCATION_ID=95a11ac3c71b4f0f87a09cc23f2e742b 2025-05-07T20:23:34.1669665Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:34.1669922Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:34.1670486Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_78ba5815-fdf5-45ec-beb7-0271d86c1f0b 2025-05-07T20:23:34.1671081Z BUILD_ENV=build_binary 2025-05-07T20:23:34.1671305Z GITHUB_ACTOR=q10 2025-05-07T20:23:34.1671518Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:34.1671735Z KERN_NAME_LC=linux 2025-05-07T20:23:34.1671955Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:34.1672251Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:34.1672580Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:34.1672820Z USER=ec2-user 2025-05-07T20:23:34.1673049Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:34.1673316Z SHLVL=1 2025-05-07T20:23:34.1673512Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:34.1673841Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:34.1674296Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:34.1674639Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:34.1674876Z KERN_NAME=Linux 2025-05-07T20:23:34.1675103Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:34.1675498Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:34.1675914Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:34.1676183Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:34.1676414Z JOURNAL_STREAM=8:84460 2025-05-07T20:23:34.1676723Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:34.1677082Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:34.1677380Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:34.1677703Z GITHUB_BASE_REF=main 2025-05-07T20:23:34.1677922Z CI=true 2025-05-07T20:23:34.1678122Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:34.1678402Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:34.1678674Z GITHUB_ACTION_REF= 2025-05-07T20:23:34.1678921Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:34.1679508Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_78ba5815-fdf5-45ec-beb7-0271d86c1f0b 2025-05-07T20:23:34.1680080Z MACHINE_NAME=x86_64 2025-05-07T20:23:34.1680340Z _=/usr/bin/printenv 2025-05-07T20:23:34.1680481Z 2025-05-07T20:23:34.1680598Z ################################################################################ 2025-05-07T20:23:34.1680910Z [INFO] Print ldd version ... 2025-05-07T20:23:34.1681160Z + ldd --version 2025-05-07T20:23:34.1681283Z 2025-05-07T20:23:34.1681373Z ldd (GNU libc) 2.34 2025-05-07T20:23:34.1681626Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:34.1682055Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:34.1682604Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:34.1683031Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:34.1683249Z 2025-05-07T20:23:34.1683366Z ################################################################################ 2025-05-07T20:23:34.1683668Z [INFO] Print CPU info ... 2025-05-07T20:23:34.1683902Z + nproc 2025-05-07T20:23:34.1684007Z 2025-05-07T20:23:34.1686952Z 16 2025-05-07T20:23:34.1688586Z 2025-05-07T20:23:34.1688753Z + lscpu 2025-05-07T20:23:34.1688860Z 2025-05-07T20:23:34.1761769Z Architecture: x86_64 2025-05-07T20:23:34.1762474Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:34.1763225Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1763876Z Byte Order: Little Endian 2025-05-07T20:23:34.1764223Z CPU(s): 16 2025-05-07T20:23:34.1764520Z On-line CPU(s) list: 0-15 2025-05-07T20:23:34.1764832Z Vendor ID: AuthenticAMD 2025-05-07T20:23:34.1765165Z Model name: AMD EPYC 7R32 2025-05-07T20:23:34.1765474Z CPU family: 23 2025-05-07T20:23:34.1765989Z Model: 49 2025-05-07T20:23:34.1766279Z Thread(s) per core: 2 2025-05-07T20:23:34.1766564Z Core(s) per socket: 8 2025-05-07T20:23:34.1766844Z Socket(s): 1 2025-05-07T20:23:34.1767120Z Stepping: 0 2025-05-07T20:23:34.1767418Z BogoMIPS: 5599.85 2025-05-07T20:23:34.1769431Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1771427Z Hypervisor vendor: KVM 2025-05-07T20:23:34.1771732Z Virtualization type: full 2025-05-07T20:23:34.1772073Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:34.1772431Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:34.1772781Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:34.1773121Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:34.1773438Z NUMA node(s): 1 2025-05-07T20:23:34.1773833Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:34.1774199Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:34.1774570Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:34.1774924Z Vulnerability L1tf: Not affected 2025-05-07T20:23:34.1775261Z Vulnerability Mds: Not affected 2025-05-07T20:23:34.1775614Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:34.1775969Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:34.1776327Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:34.1776857Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:34.1777423Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:34.1777953Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:34.1778614Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:34.1779486Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:34.1780179Z Vulnerability Srbds: Not affected 2025-05-07T20:23:34.1780540Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:34.1780886Z 2025-05-07T20:23:34.1781016Z + cat /proc/cpuinfo 2025-05-07T20:23:34.1781217Z 2025-05-07T20:23:34.1781333Z processor : 0 2025-05-07T20:23:34.1781629Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1782144Z cpu family : 23 2025-05-07T20:23:34.1782415Z model : 49 2025-05-07T20:23:34.1782698Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1783016Z stepping : 0 2025-05-07T20:23:34.1783224Z microcode : 0x830107f 2025-05-07T20:23:34.1783450Z cpu MHz : 3298.326 2025-05-07T20:23:34.1783666Z cache size : 512 KB 2025-05-07T20:23:34.1783876Z physical id : 0 2025-05-07T20:23:34.1784104Z siblings : 16 2025-05-07T20:23:34.1784339Z core id : 0 2025-05-07T20:23:34.1784534Z cpu cores : 8 2025-05-07T20:23:34.1784736Z apicid : 0 2025-05-07T20:23:34.1784935Z initial apicid : 0 2025-05-07T20:23:34.1785144Z fpu : yes 2025-05-07T20:23:34.1785343Z fpu_exception : yes 2025-05-07T20:23:34.1785560Z cpuid level : 13 2025-05-07T20:23:34.1785765Z wp : yes 2025-05-07T20:23:34.1787827Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1790008Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1790493Z bogomips : 5599.85 2025-05-07T20:23:34.1790716Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1790947Z clflush size : 64 2025-05-07T20:23:34.1791167Z cache_alignment : 64 2025-05-07T20:23:34.1791438Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1791759Z power management: 2025-05-07T20:23:34.1791902Z 2025-05-07T20:23:34.1791985Z processor : 1 2025-05-07T20:23:34.1792207Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1792439Z cpu family : 23 2025-05-07T20:23:34.1792654Z model : 49 2025-05-07T20:23:34.1792861Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1793096Z stepping : 0 2025-05-07T20:23:34.1793322Z microcode : 0x830107f 2025-05-07T20:23:34.1804126Z cpu MHz : 3307.447 2025-05-07T20:23:34.1804361Z cache size : 512 KB 2025-05-07T20:23:34.1804585Z physical id : 0 2025-05-07T20:23:34.1804801Z siblings : 16 2025-05-07T20:23:34.1804998Z core id : 1 2025-05-07T20:23:34.1805204Z cpu cores : 8 2025-05-07T20:23:34.1805408Z apicid : 2 2025-05-07T20:23:34.1805607Z initial apicid : 2 2025-05-07T20:23:34.1805828Z fpu : yes 2025-05-07T20:23:34.1806039Z fpu_exception : yes 2025-05-07T20:23:34.1806256Z cpuid level : 13 2025-05-07T20:23:34.1806469Z wp : yes 2025-05-07T20:23:34.1808386Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1810561Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1811046Z bogomips : 5599.85 2025-05-07T20:23:34.1811268Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1811510Z clflush size : 64 2025-05-07T20:23:34.1811735Z cache_alignment : 64 2025-05-07T20:23:34.1812001Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1812321Z power management: 2025-05-07T20:23:34.1812454Z 2025-05-07T20:23:34.1812553Z processor : 2 2025-05-07T20:23:34.1812770Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1813018Z cpu family : 23 2025-05-07T20:23:34.1813236Z model : 49 2025-05-07T20:23:34.1813442Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1814064Z stepping : 0 2025-05-07T20:23:34.1814306Z microcode : 0x830107f 2025-05-07T20:23:34.1814557Z cpu MHz : 3301.782 2025-05-07T20:23:34.1814769Z cache size : 512 KB 2025-05-07T20:23:34.1814987Z physical id : 0 2025-05-07T20:23:34.1815206Z siblings : 16 2025-05-07T20:23:34.1815405Z core id : 2 2025-05-07T20:23:34.1815608Z cpu cores : 8 2025-05-07T20:23:34.1815813Z apicid : 4 2025-05-07T20:23:34.1816010Z initial apicid : 4 2025-05-07T20:23:34.1816229Z fpu : yes 2025-05-07T20:23:34.1816433Z fpu_exception : yes 2025-05-07T20:23:34.1816646Z cpuid level : 13 2025-05-07T20:23:34.1816859Z wp : yes 2025-05-07T20:23:34.1818894Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1821062Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1821538Z bogomips : 5599.85 2025-05-07T20:23:34.1821766Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1822008Z clflush size : 64 2025-05-07T20:23:34.1822223Z cache_alignment : 64 2025-05-07T20:23:34.1822496Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1822813Z power management: 2025-05-07T20:23:34.1822947Z 2025-05-07T20:23:34.1823044Z processor : 3 2025-05-07T20:23:34.1823259Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1823504Z cpu family : 23 2025-05-07T20:23:34.1823717Z model : 49 2025-05-07T20:23:34.1823920Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1824164Z stepping : 0 2025-05-07T20:23:34.1824380Z microcode : 0x830107f 2025-05-07T20:23:34.1824608Z cpu MHz : 3299.453 2025-05-07T20:23:34.1824828Z cache size : 512 KB 2025-05-07T20:23:34.1825045Z physical id : 0 2025-05-07T20:23:34.1825250Z siblings : 16 2025-05-07T20:23:34.1825452Z core id : 3 2025-05-07T20:23:34.1825656Z cpu cores : 8 2025-05-07T20:23:34.1825851Z apicid : 6 2025-05-07T20:23:34.1826051Z initial apicid : 6 2025-05-07T20:23:34.1826264Z fpu : yes 2025-05-07T20:23:34.1826458Z fpu_exception : yes 2025-05-07T20:23:34.1826677Z cpuid level : 13 2025-05-07T20:23:34.1826886Z wp : yes 2025-05-07T20:23:34.1828779Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1830934Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1831417Z bogomips : 5599.85 2025-05-07T20:23:34.1831646Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1831889Z clflush size : 64 2025-05-07T20:23:34.1832103Z cache_alignment : 64 2025-05-07T20:23:34.1832377Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1832694Z power management: 2025-05-07T20:23:34.1832824Z 2025-05-07T20:23:34.1832908Z processor : 4 2025-05-07T20:23:34.1833130Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1833371Z cpu family : 23 2025-05-07T20:23:34.1833574Z model : 49 2025-05-07T20:23:34.1833792Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1834037Z stepping : 0 2025-05-07T20:23:34.1834240Z microcode : 0x830107f 2025-05-07T20:23:34.1834471Z cpu MHz : 3296.391 2025-05-07T20:23:34.1834786Z cache size : 512 KB 2025-05-07T20:23:34.1835002Z physical id : 0 2025-05-07T20:23:34.1835214Z siblings : 16 2025-05-07T20:23:34.1835424Z core id : 4 2025-05-07T20:23:34.1835621Z cpu cores : 8 2025-05-07T20:23:34.1835824Z apicid : 8 2025-05-07T20:23:34.1836029Z initial apicid : 8 2025-05-07T20:23:34.1836237Z fpu : yes 2025-05-07T20:23:34.1836501Z fpu_exception : yes 2025-05-07T20:23:34.1836730Z cpuid level : 13 2025-05-07T20:23:34.1836943Z wp : yes 2025-05-07T20:23:34.1838907Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1841066Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1841544Z bogomips : 5599.85 2025-05-07T20:23:34.1841769Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1841993Z clflush size : 64 2025-05-07T20:23:34.1842210Z cache_alignment : 64 2025-05-07T20:23:34.1842476Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1842782Z power management: 2025-05-07T20:23:34.1842921Z 2025-05-07T20:23:34.1843004Z processor : 5 2025-05-07T20:23:34.1843217Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1843453Z cpu family : 23 2025-05-07T20:23:34.1843651Z model : 49 2025-05-07T20:23:34.1843853Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1844096Z stepping : 0 2025-05-07T20:23:34.1844296Z microcode : 0x830107f 2025-05-07T20:23:34.1844517Z cpu MHz : 3286.612 2025-05-07T20:23:34.1844729Z cache size : 512 KB 2025-05-07T20:23:34.1844935Z physical id : 0 2025-05-07T20:23:34.1845145Z siblings : 16 2025-05-07T20:23:34.1845343Z core id : 5 2025-05-07T20:23:34.1845532Z cpu cores : 8 2025-05-07T20:23:34.1845733Z apicid : 10 2025-05-07T20:23:34.1845936Z initial apicid : 10 2025-05-07T20:23:34.1846143Z fpu : yes 2025-05-07T20:23:34.1846343Z fpu_exception : yes 2025-05-07T20:23:34.1846565Z cpuid level : 13 2025-05-07T20:23:34.1846766Z wp : yes 2025-05-07T20:23:34.1848662Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1850817Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1851300Z bogomips : 5599.85 2025-05-07T20:23:34.1851512Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1851751Z clflush size : 64 2025-05-07T20:23:34.1851971Z cache_alignment : 64 2025-05-07T20:23:34.1852243Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1852548Z power management: 2025-05-07T20:23:34.1852687Z 2025-05-07T20:23:34.1852769Z processor : 6 2025-05-07T20:23:34.1852983Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1853212Z cpu family : 23 2025-05-07T20:23:34.1853422Z model : 49 2025-05-07T20:23:34.1853626Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1853939Z stepping : 0 2025-05-07T20:23:34.1854148Z microcode : 0x830107f 2025-05-07T20:23:34.1854376Z cpu MHz : 3290.744 2025-05-07T20:23:34.1854584Z cache size : 512 KB 2025-05-07T20:23:34.1854805Z physical id : 0 2025-05-07T20:23:34.1855019Z siblings : 16 2025-05-07T20:23:34.1855212Z core id : 6 2025-05-07T20:23:34.1855500Z cpu cores : 8 2025-05-07T20:23:34.1855701Z apicid : 12 2025-05-07T20:23:34.1855901Z initial apicid : 12 2025-05-07T20:23:34.1856111Z fpu : yes 2025-05-07T20:23:34.1856312Z fpu_exception : yes 2025-05-07T20:23:34.1856523Z cpuid level : 13 2025-05-07T20:23:34.1856733Z wp : yes 2025-05-07T20:23:34.1858735Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1861012Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1861501Z bogomips : 5599.85 2025-05-07T20:23:34.1861711Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1861947Z clflush size : 64 2025-05-07T20:23:34.1862160Z cache_alignment : 64 2025-05-07T20:23:34.1862418Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1862730Z power management: 2025-05-07T20:23:34.1862858Z 2025-05-07T20:23:34.1862947Z processor : 7 2025-05-07T20:23:34.1863153Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1863386Z cpu family : 23 2025-05-07T20:23:34.1863592Z model : 49 2025-05-07T20:23:34.1863808Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1864076Z stepping : 0 2025-05-07T20:23:34.1864282Z microcode : 0x830107f 2025-05-07T20:23:34.1864500Z cpu MHz : 3294.699 2025-05-07T20:23:34.1864722Z cache size : 512 KB 2025-05-07T20:23:34.1864939Z physical id : 0 2025-05-07T20:23:34.1865147Z siblings : 16 2025-05-07T20:23:34.1865351Z core id : 7 2025-05-07T20:23:34.1865550Z cpu cores : 8 2025-05-07T20:23:34.1865746Z apicid : 14 2025-05-07T20:23:34.1865953Z initial apicid : 14 2025-05-07T20:23:34.1866161Z fpu : yes 2025-05-07T20:23:34.1866350Z fpu_exception : yes 2025-05-07T20:23:34.1866559Z cpuid level : 13 2025-05-07T20:23:34.1866764Z wp : yes 2025-05-07T20:23:34.1868653Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1870800Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1871277Z bogomips : 5599.85 2025-05-07T20:23:34.1871492Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1871724Z clflush size : 64 2025-05-07T20:23:34.1871939Z cache_alignment : 64 2025-05-07T20:23:34.1872196Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1872505Z power management: 2025-05-07T20:23:34.1872633Z 2025-05-07T20:23:34.1872723Z processor : 8 2025-05-07T20:23:34.1872927Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1873163Z cpu family : 23 2025-05-07T20:23:34.1873366Z model : 49 2025-05-07T20:23:34.1873564Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1873802Z stepping : 0 2025-05-07T20:23:34.1874012Z microcode : 0x830107f 2025-05-07T20:23:34.1874229Z cpu MHz : 3298.834 2025-05-07T20:23:34.1874440Z cache size : 512 KB 2025-05-07T20:23:34.1874651Z physical id : 0 2025-05-07T20:23:34.1874853Z siblings : 16 2025-05-07T20:23:34.1875052Z core id : 0 2025-05-07T20:23:34.1875251Z cpu cores : 8 2025-05-07T20:23:34.1875442Z apicid : 1 2025-05-07T20:23:34.1875637Z initial apicid : 1 2025-05-07T20:23:34.1875933Z fpu : yes 2025-05-07T20:23:34.1876121Z fpu_exception : yes 2025-05-07T20:23:34.1876334Z cpuid level : 13 2025-05-07T20:23:34.1876540Z wp : yes 2025-05-07T20:23:34.1878422Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1880655Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1881124Z bogomips : 5599.85 2025-05-07T20:23:34.1881343Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1881581Z clflush size : 64 2025-05-07T20:23:34.1881789Z cache_alignment : 64 2025-05-07T20:23:34.1882054Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1882361Z power management: 2025-05-07T20:23:34.1882487Z 2025-05-07T20:23:34.1882571Z processor : 9 2025-05-07T20:23:34.1882781Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1883013Z cpu family : 23 2025-05-07T20:23:34.1883209Z model : 49 2025-05-07T20:23:34.1883411Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1883645Z stepping : 0 2025-05-07T20:23:34.1883843Z microcode : 0x830107f 2025-05-07T20:23:34.1884069Z cpu MHz : 3283.912 2025-05-07T20:23:34.1884276Z cache size : 512 KB 2025-05-07T20:23:34.1884492Z physical id : 0 2025-05-07T20:23:34.1884691Z siblings : 16 2025-05-07T20:23:34.1884886Z core id : 1 2025-05-07T20:23:34.1885086Z cpu cores : 8 2025-05-07T20:23:34.1885281Z apicid : 3 2025-05-07T20:23:34.1885477Z initial apicid : 3 2025-05-07T20:23:34.1885690Z fpu : yes 2025-05-07T20:23:34.1885880Z fpu_exception : yes 2025-05-07T20:23:34.1886100Z cpuid level : 13 2025-05-07T20:23:34.1886305Z wp : yes 2025-05-07T20:23:34.1888185Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1890338Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1890820Z bogomips : 5599.85 2025-05-07T20:23:34.1891038Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1891266Z clflush size : 64 2025-05-07T20:23:34.1891480Z cache_alignment : 64 2025-05-07T20:23:34.1891750Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1892060Z power management: 2025-05-07T20:23:34.1892188Z 2025-05-07T20:23:34.1892274Z processor : 10 2025-05-07T20:23:34.1892487Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1892722Z cpu family : 23 2025-05-07T20:23:34.1892922Z model : 49 2025-05-07T20:23:34.1893126Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1893396Z stepping : 0 2025-05-07T20:23:34.1893607Z microcode : 0x830107f 2025-05-07T20:23:34.1893948Z cpu MHz : 3299.820 2025-05-07T20:23:34.1894171Z cache size : 512 KB 2025-05-07T20:23:34.1894379Z physical id : 0 2025-05-07T20:23:34.1894583Z siblings : 16 2025-05-07T20:23:34.1894774Z core id : 2 2025-05-07T20:23:34.1894972Z cpu cores : 8 2025-05-07T20:23:34.1895167Z apicid : 5 2025-05-07T20:23:34.1895360Z initial apicid : 5 2025-05-07T20:23:34.1895571Z fpu : yes 2025-05-07T20:23:34.1895767Z fpu_exception : yes 2025-05-07T20:23:34.1895974Z cpuid level : 13 2025-05-07T20:23:34.1896268Z wp : yes 2025-05-07T20:23:34.1898153Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1900721Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1901188Z bogomips : 5599.85 2025-05-07T20:23:34.1901554Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1901791Z clflush size : 64 2025-05-07T20:23:34.1901999Z cache_alignment : 64 2025-05-07T20:23:34.1902261Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1902579Z power management: 2025-05-07T20:23:34.1902706Z 2025-05-07T20:23:34.1902796Z processor : 11 2025-05-07T20:23:34.1903002Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1903234Z cpu family : 23 2025-05-07T20:23:34.1903438Z model : 49 2025-05-07T20:23:34.1903636Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1903882Z stepping : 0 2025-05-07T20:23:34.1904089Z microcode : 0x830107f 2025-05-07T20:23:34.1904332Z cpu MHz : 3301.322 2025-05-07T20:23:34.1904569Z cache size : 512 KB 2025-05-07T20:23:34.1904780Z physical id : 0 2025-05-07T20:23:34.1904984Z siblings : 16 2025-05-07T20:23:34.1905185Z core id : 3 2025-05-07T20:23:34.1905381Z cpu cores : 8 2025-05-07T20:23:34.1905573Z apicid : 7 2025-05-07T20:23:34.1905769Z initial apicid : 7 2025-05-07T20:23:34.1905983Z fpu : yes 2025-05-07T20:23:34.1906174Z fpu_exception : yes 2025-05-07T20:23:34.1906391Z cpuid level : 13 2025-05-07T20:23:34.1906598Z wp : yes 2025-05-07T20:23:34.1908484Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1910633Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1911110Z bogomips : 5599.85 2025-05-07T20:23:34.1911328Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1911567Z clflush size : 64 2025-05-07T20:23:34.1911775Z cache_alignment : 64 2025-05-07T20:23:34.1912039Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1912348Z power management: 2025-05-07T20:23:34.1912480Z 2025-05-07T20:23:34.1912562Z processor : 12 2025-05-07T20:23:34.1912776Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1913010Z cpu family : 23 2025-05-07T20:23:34.1913207Z model : 49 2025-05-07T20:23:34.1913415Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1913653Z stepping : 0 2025-05-07T20:23:34.1913876Z microcode : 0x830107f 2025-05-07T20:23:34.1914123Z cpu MHz : 3300.495 2025-05-07T20:23:34.1914334Z cache size : 512 KB 2025-05-07T20:23:34.1914541Z physical id : 0 2025-05-07T20:23:34.1914747Z siblings : 16 2025-05-07T20:23:34.1914947Z core id : 4 2025-05-07T20:23:34.1915137Z cpu cores : 8 2025-05-07T20:23:34.1915336Z apicid : 9 2025-05-07T20:23:34.1915536Z initial apicid : 9 2025-05-07T20:23:34.1915741Z fpu : yes 2025-05-07T20:23:34.1915941Z fpu_exception : yes 2025-05-07T20:23:34.1916160Z cpuid level : 13 2025-05-07T20:23:34.1916358Z wp : yes 2025-05-07T20:23:34.1918241Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1920523Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1921001Z bogomips : 5599.85 2025-05-07T20:23:34.1921217Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1921450Z clflush size : 64 2025-05-07T20:23:34.1921669Z cache_alignment : 64 2025-05-07T20:23:34.1922022Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1922328Z power management: 2025-05-07T20:23:34.1922461Z 2025-05-07T20:23:34.1922546Z processor : 13 2025-05-07T20:23:34.1922770Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1923001Z cpu family : 23 2025-05-07T20:23:34.1923212Z model : 49 2025-05-07T20:23:34.1923418Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1923655Z stepping : 0 2025-05-07T20:23:34.1923891Z microcode : 0x830107f 2025-05-07T20:23:34.1924141Z cpu MHz : 3292.898 2025-05-07T20:23:34.1924359Z cache size : 512 KB 2025-05-07T20:23:34.1924565Z physical id : 0 2025-05-07T20:23:34.1924771Z siblings : 16 2025-05-07T20:23:34.1924972Z core id : 5 2025-05-07T20:23:34.1925167Z cpu cores : 8 2025-05-07T20:23:34.1925370Z apicid : 11 2025-05-07T20:23:34.1925578Z initial apicid : 11 2025-05-07T20:23:34.1925783Z fpu : yes 2025-05-07T20:23:34.1925980Z fpu_exception : yes 2025-05-07T20:23:34.1926196Z cpuid level : 13 2025-05-07T20:23:34.1926393Z wp : yes 2025-05-07T20:23:34.1928289Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1930443Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1930920Z bogomips : 5599.85 2025-05-07T20:23:34.1931132Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1931365Z clflush size : 64 2025-05-07T20:23:34.1931581Z cache_alignment : 64 2025-05-07T20:23:34.1931840Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1932152Z power management: 2025-05-07T20:23:34.1932290Z 2025-05-07T20:23:34.1932373Z processor : 14 2025-05-07T20:23:34.1932588Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1932814Z cpu family : 23 2025-05-07T20:23:34.1933020Z model : 49 2025-05-07T20:23:34.1933225Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1933457Z stepping : 0 2025-05-07T20:23:34.1933774Z microcode : 0x830107f 2025-05-07T20:23:34.1933998Z cpu MHz : 3299.445 2025-05-07T20:23:34.1934203Z cache size : 512 KB 2025-05-07T20:23:34.1934415Z physical id : 0 2025-05-07T20:23:34.1934618Z siblings : 16 2025-05-07T20:23:34.1934808Z core id : 6 2025-05-07T20:23:34.1935005Z cpu cores : 8 2025-05-07T20:23:34.1935203Z apicid : 13 2025-05-07T20:23:34.1935403Z initial apicid : 13 2025-05-07T20:23:34.1935614Z fpu : yes 2025-05-07T20:23:34.1935813Z fpu_exception : yes 2025-05-07T20:23:34.1936021Z cpuid level : 13 2025-05-07T20:23:34.1936225Z wp : yes 2025-05-07T20:23:34.1938115Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1940379Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1940859Z bogomips : 5599.85 2025-05-07T20:23:34.1941071Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1941304Z clflush size : 64 2025-05-07T20:23:34.1941519Z cache_alignment : 64 2025-05-07T20:23:34.1941776Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1942087Z power management: 2025-05-07T20:23:34.1942216Z 2025-05-07T20:23:34.1942390Z processor : 15 2025-05-07T20:23:34.1942600Z vendor_id : AuthenticAMD 2025-05-07T20:23:34.1942835Z cpu family : 23 2025-05-07T20:23:34.1943037Z model : 49 2025-05-07T20:23:34.1943245Z model name : AMD EPYC 7R32 2025-05-07T20:23:34.1943478Z stepping : 0 2025-05-07T20:23:34.1951115Z microcode : 0x830107f 2025-05-07T20:23:34.1951351Z cpu MHz : 3295.983 2025-05-07T20:23:34.1951568Z cache size : 512 KB 2025-05-07T20:23:34.1951782Z physical id : 0 2025-05-07T20:23:34.1951984Z siblings : 16 2025-05-07T20:23:34.1952183Z core id : 7 2025-05-07T20:23:34.1952383Z cpu cores : 8 2025-05-07T20:23:34.1952577Z apicid : 15 2025-05-07T20:23:34.1952778Z initial apicid : 15 2025-05-07T20:23:34.1952990Z fpu : yes 2025-05-07T20:23:34.1953183Z fpu_exception : yes 2025-05-07T20:23:34.1953398Z cpuid level : 13 2025-05-07T20:23:34.1953604Z wp : yes 2025-05-07T20:23:34.1955549Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:34.1957708Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:34.1958186Z bogomips : 5599.85 2025-05-07T20:23:34.1958409Z TLB size : 3072 4K pages 2025-05-07T20:23:34.1958645Z clflush size : 64 2025-05-07T20:23:34.1958854Z cache_alignment : 64 2025-05-07T20:23:34.1959122Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:34.1959433Z power management: 2025-05-07T20:23:34.1959559Z 2025-05-07T20:23:34.1959564Z 2025-05-07T20:23:34.1959682Z ################################################################################ 2025-05-07T20:23:34.1959988Z [INFO] Print PCI info ... 2025-05-07T20:23:34.1960231Z + lspci -v 2025-05-07T20:23:34.1960342Z 2025-05-07T20:23:34.1960561Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:34.1960934Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:34.1961253Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:34.1961460Z 2025-05-07T20:23:34.1961650Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:34.1962020Z Physical Slot: 1 2025-05-07T20:23:34.1962251Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:34.1962453Z 2025-05-07T20:23:34.1962695Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:34.1963117Z Physical Slot: 1 2025-05-07T20:23:34.1963363Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:34.1963586Z 2025-05-07T20:23:34.1963844Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:34.1964317Z Physical Slot: 3 2025-05-07T20:23:34.1964565Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:34.1965010Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:34.1965359Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:34.1965575Z 2025-05-07T20:23:34.1965873Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:34.1966366Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:34.1966647Z Physical Slot: 4 2025-05-07T20:23:34.1966898Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:34.1967268Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:34.1967610Z Capabilities: 2025-05-07T20:23:34.1967884Z Kernel driver in use: nvme 2025-05-07T20:23:34.1968045Z 2025-05-07T20:23:34.1968348Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:34.1968809Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:34.1969148Z Physical Slot: 5 2025-05-07T20:23:34.1969390Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:34.1969740Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:34.1970113Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:34.1970436Z Capabilities: 2025-05-07T20:23:34.1970696Z Kernel driver in use: ena 2025-05-07T20:23:34.1970930Z Kernel modules: ena 2025-05-07T20:23:34.1971072Z 2025-05-07T20:23:34.1971236Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:34.1971605Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:34.1971887Z Physical Slot: 30 2025-05-07T20:23:34.1972144Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:34.1972509Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:34.1972883Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:34.1973246Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:34.1973570Z Capabilities: 2025-05-07T20:23:34.1973962Z Kernel driver in use: nvidia 2025-05-07T20:23:34.1974241Z Kernel modules: nvidia 2025-05-07T20:23:34.1974388Z 2025-05-07T20:23:34.1974679Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:34.1975177Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:34.1975454Z Physical Slot: 31 2025-05-07T20:23:34.1975695Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:34.1976041Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:34.1976410Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:34.1976735Z Capabilities: 2025-05-07T20:23:34.1976992Z Kernel driver in use: nvme 2025-05-07T20:23:34.1977147Z 2025-05-07T20:23:34.1977151Z 2025-05-07T20:23:34.1977273Z ################################################################################ 2025-05-07T20:23:34.1977589Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:34.1977868Z + uname -a 2025-05-07T20:23:34.1977983Z 2025-05-07T20:23:34.1978383Z Linux ip-10-0-14-174.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:34.1978864Z 2025-05-07T20:23:34.1978945Z + uname -m 2025-05-07T20:23:34.1979059Z 2025-05-07T20:23:34.1979132Z x86_64 2025-05-07T20:23:34.1979239Z 2025-05-07T20:23:34.1979325Z + cat /proc/version 2025-05-07T20:23:34.1979452Z 2025-05-07T20:23:34.1979978Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:34.1980580Z 2025-05-07T20:23:34.1980669Z + cat /etc/os-release 2025-05-07T20:23:34.1980817Z 2025-05-07T20:23:34.1980907Z NAME="Amazon Linux" 2025-05-07T20:23:34.1981112Z VERSION="2023" 2025-05-07T20:23:34.1981312Z ID="amzn" 2025-05-07T20:23:34.1981493Z ID_LIKE="fedora" 2025-05-07T20:23:34.1981701Z VERSION_ID="2023" 2025-05-07T20:23:34.1982021Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:34.1982291Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:34.1982567Z ANSI_COLOR="0;33" 2025-05-07T20:23:34.1982812Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:34.1983191Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:34.1983611Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:34.1984017Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:34.1984473Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:34.1984856Z VENDOR_NAME="AWS" 2025-05-07T20:23:34.1985092Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:34.1985376Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:34.1985525Z 2025-05-07T20:23:34.1985726Z ################################################################################ 2025-05-07T20:23:34.1986029Z # Print EC2 Instance Info 2025-05-07T20:23:34.1986263Z # 2025-05-07T20:23:34.1986475Z # [2025-05-07T20:23:34.197Z] + print_ec2_info 2025-05-07T20:23:34.1986793Z ################################################################################ 2025-05-07T20:23:34.1987000Z 2025-05-07T20:23:34.2096670Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:34.2212506Z instance-id: i-0b68a33264ad7b273 2025-05-07T20:23:34.2335605Z instance-type: g5.4xlarge 2025-05-07T20:23:34.2373358Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:34.2373858Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:34.2383649Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:34.2384049Z env: 2025-05-07T20:23:34.2384273Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:34.2384572Z BUILD_ENV: build_binary 2025-05-07T20:23:34.2384815Z BUILD_TARGET: genai 2025-05-07T20:23:34.2385044Z BUILD_VARIANT: cuda 2025-05-07T20:23:34.2385281Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:34.2385535Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:34.2385833Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:34.2386165Z ##[endgroup] 2025-05-07T20:23:34.5753219Z ################################################################################ 2025-05-07T20:23:34.5753619Z [INFO] Printing general display info ... 2025-05-07T20:23:34.5767585Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:34.6662871Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:34.6671576Z /usr/bin/sudo 2025-05-07T20:23:34.6682526Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:34.6693505Z /usr/bin/yum 2025-05-07T20:23:34.6695319Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:34.6716868Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:35.0822361Z Last metadata expiration check: 0:00:07 ago on Wed May 7 20:23:28 2025. 2025-05-07T20:23:35.1731342Z ================================================================================ 2025-05-07T20:23:35.1731803Z WARNING: 2025-05-07T20:23:35.1732075Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:35.1732311Z 2025-05-07T20:23:35.1732402Z Available Versions: 2025-05-07T20:23:35.1732549Z 2025-05-07T20:23:35.1732646Z Version 2023.7.20250331: 2025-05-07T20:23:35.1732948Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:35.1733207Z 2025-05-07T20:23:35.1733341Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:35.1733554Z 2025-05-07T20:23:35.1733727Z Release notes: 2025-05-07T20:23:35.1734127Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:35.1734489Z 2025-05-07T20:23:35.1734598Z Version 2023.7.20250414: 2025-05-07T20:23:35.1734896Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:35.1735146Z 2025-05-07T20:23:35.1735261Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:35.1735466Z 2025-05-07T20:23:35.1735559Z Release notes: 2025-05-07T20:23:35.1735940Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:35.1736585Z 2025-05-07T20:23:35.1736673Z Version 2023.7.20250428: 2025-05-07T20:23:35.1736974Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:35.1737214Z 2025-05-07T20:23:35.1737330Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:35.1737538Z 2025-05-07T20:23:35.1737622Z Release notes: 2025-05-07T20:23:35.1738003Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:35.1738355Z 2025-05-07T20:23:35.1738471Z ================================================================================ 2025-05-07T20:23:35.2892018Z Dependencies resolved. 2025-05-07T20:23:35.3175140Z ================================================================================ 2025-05-07T20:23:35.3175948Z Package Arch Version Repository Size 2025-05-07T20:23:35.3176706Z ================================================================================ 2025-05-07T20:23:35.3177313Z Upgrading: 2025-05-07T20:23:35.3178010Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:35.3179152Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:35.3179839Z 2025-05-07T20:23:35.3180434Z Transaction Summary 2025-05-07T20:23:35.3180936Z ================================================================================ 2025-05-07T20:23:35.3181535Z Upgrade 2 Packages 2025-05-07T20:23:35.3181799Z 2025-05-07T20:23:35.3182011Z Total download size: 6.9 M 2025-05-07T20:23:35.3182504Z Downloading Packages: 2025-05-07T20:23:35.3566222Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 33 MB/s | 1.2 MB 00:00 2025-05-07T20:23:35.4234277Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 54 MB/s | 5.7 MB 00:00 2025-05-07T20:23:35.4245265Z -------------------------------------------------------------------------------- 2025-05-07T20:23:35.4246432Z Total 65 MB/s | 6.9 MB 00:00 2025-05-07T20:23:35.4249693Z Running transaction check 2025-05-07T20:23:35.4355582Z Transaction check succeeded. 2025-05-07T20:23:35.4356022Z Running transaction test 2025-05-07T20:23:35.4651640Z Transaction test succeeded. 2025-05-07T20:23:35.4654518Z Running transaction 2025-05-07T20:23:36.0205109Z Preparing : 1/1 2025-05-07T20:23:36.1257857Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:36.1281633Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:36.1495874Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:36.1496605Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:36.1603832Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:36.1630739Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:36.3216991Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:36.3217550Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:36.3218098Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:36.3218626Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:36.4607830Z ================================================================================ 2025-05-07T20:23:36.4608202Z WARNING: 2025-05-07T20:23:36.4608455Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:36.4608682Z 2025-05-07T20:23:36.4608783Z Available Versions: 2025-05-07T20:23:36.4608929Z 2025-05-07T20:23:36.4609035Z Version 2023.7.20250331: 2025-05-07T20:23:36.4609341Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:36.4609906Z 2025-05-07T20:23:36.4610034Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:36.4610241Z 2025-05-07T20:23:36.4610331Z Release notes: 2025-05-07T20:23:36.4610734Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:36.4611098Z 2025-05-07T20:23:36.4611205Z Version 2023.7.20250414: 2025-05-07T20:23:36.4611504Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:36.4611747Z 2025-05-07T20:23:36.4611871Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:36.4612076Z 2025-05-07T20:23:36.4612162Z Release notes: 2025-05-07T20:23:36.4612549Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:36.4612902Z 2025-05-07T20:23:36.4613000Z Version 2023.7.20250428: 2025-05-07T20:23:36.4613295Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:36.4613546Z 2025-05-07T20:23:36.4613804Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:36.4614017Z 2025-05-07T20:23:36.4614104Z Release notes: 2025-05-07T20:23:36.4614489Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:36.4614839Z 2025-05-07T20:23:36.4615192Z ================================================================================ 2025-05-07T20:23:36.5173541Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:36.5174202Z 2025-05-07T20:23:36.5174326Z Upgraded: 2025-05-07T20:23:36.5174862Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:36.5175875Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:36.5176490Z 2025-05-07T20:23:36.5176604Z Complete! 2025-05-07T20:23:36.5618537Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:36.5641850Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:37.0614577Z Last metadata expiration check: 0:00:09 ago on Wed May 7 20:23:28 2025. 2025-05-07T20:23:37.0853278Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:37.0858299Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:37.1254597Z Dependencies resolved. 2025-05-07T20:23:37.1437684Z Nothing to do. 2025-05-07T20:23:37.1438449Z Complete! 2025-05-07T20:23:37.1830519Z + hostname 2025-05-07T20:23:37.1830672Z 2025-05-07T20:23:37.1845288Z ip-10-0-14-174.ec2.internal 2025-05-07T20:23:37.1846856Z 2025-05-07T20:23:37.1847210Z + sudo lshw -C display 2025-05-07T20:23:37.1847374Z 2025-05-07T20:23:37.4575501Z *-display:0 UNCLAIMED 2025-05-07T20:23:37.4575836Z description: VGA compatible controller 2025-05-07T20:23:37.4576160Z product: Amazon.com, Inc. 2025-05-07T20:23:37.4576439Z vendor: Amazon.com, Inc. 2025-05-07T20:23:37.4576697Z physical id: 3 2025-05-07T20:23:37.4576931Z bus info: pci@0000:00:03.0 2025-05-07T20:23:37.4577219Z version: 00 2025-05-07T20:23:37.4577435Z width: 32 bits 2025-05-07T20:23:37.4577653Z clock: 33MHz 2025-05-07T20:23:37.4577904Z capabilities: vga_controller bus_master 2025-05-07T20:23:37.4578220Z configuration: latency=0 2025-05-07T20:23:37.4578555Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:37.4578881Z *-display:1 2025-05-07T20:23:37.4579105Z description: 3D controller 2025-05-07T20:23:37.4579387Z product: GA102GL [A10G] 2025-05-07T20:23:37.4579647Z vendor: NVIDIA Corporation 2025-05-07T20:23:37.4579916Z physical id: 1e 2025-05-07T20:23:37.4580155Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:37.4580405Z version: a1 2025-05-07T20:23:37.4580620Z width: 64 bits 2025-05-07T20:23:37.4580839Z clock: 33MHz 2025-05-07T20:23:37.4581118Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:37.4581487Z configuration: driver=nvidia latency=0 2025-05-07T20:23:37.4582408Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:37.4616201Z 2025-05-07T20:23:37.4616565Z ################################################################################ 2025-05-07T20:23:37.4616928Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:37.4744061Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:37.4927840Z Wed May 7 20:23:37 2025 2025-05-07T20:23:37.4928193Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:37.4928697Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:37.4929170Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:37.4929643Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:37.4930174Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:37.4930589Z | | | MIG M. | 2025-05-07T20:23:37.4931205Z |=========================================+========================+======================| 2025-05-07T20:23:37.5062247Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:37.5062686Z | 0% 29C P8 24W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:37.5063080Z | | | N/A | 2025-05-07T20:23:37.5063459Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:37.5067439Z 2025-05-07T20:23:37.5068217Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:37.5069057Z | Processes: | 2025-05-07T20:23:37.5069894Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:37.5070707Z | ID ID Usage | 2025-05-07T20:23:37.5071370Z |=========================================================================================| 2025-05-07T20:23:37.5072203Z | No running processes found | 2025-05-07T20:23:37.5073098Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:37.7710964Z ################################################################################ 2025-05-07T20:23:37.7711295Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:37.7865776Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:37.7866233Z [CHECK] rocminfo not found 2025-05-07T20:23:37.7866932Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:37.7868734Z [CHECK] rocm-smi not found 2025-05-07T20:23:37.7905250Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:37.7905672Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:37.7917729Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:37.7918074Z env: 2025-05-07T20:23:37.7918298Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:37.7918587Z BUILD_ENV: build_binary 2025-05-07T20:23:37.7918829Z BUILD_TARGET: genai 2025-05-07T20:23:37.7919058Z BUILD_VARIANT: cuda 2025-05-07T20:23:37.7919284Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:37.7919536Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:37.7919835Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:37.7920155Z ##[endgroup] 2025-05-07T20:23:38.1282392Z ################################################################################ 2025-05-07T20:23:38.1282761Z # Setup Miniconda 2025-05-07T20:23:38.1282971Z # 2025-05-07T20:23:38.1297625Z # [2025-05-07T20:23:38.129Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:38.1298036Z ################################################################################ 2025-05-07T20:23:38.1298519Z 2025-05-07T20:23:38.1314129Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:38.2223328Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:38.2223837Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ... 2025-05-07T20:23:38.2224377Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ... 2025-05-07T20:23:38.2224739Z + rm -rf /home/ec2-user/miniconda 2025-05-07T20:23:38.2224927Z 2025-05-07T20:23:43.2506857Z 2025-05-07T20:23:43.2507454Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:43.2507718Z 2025-05-07T20:23:43.2523485Z 2025-05-07T20:23:43.2523806Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:43.2546619Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:44.2614384Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:44.2614764Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:44.2615011Z 2025-05-07T20:23:44.2761383Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:44.7255795Z Unpacking payload ... 2025-05-07T20:23:45.2427737Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:46.0462660Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:48.1745781Z 2025-05-07T20:23:48.1746161Z Installing base environment... 2025-05-07T20:23:48.1746397Z 2025-05-07T20:23:49.2539381Z Preparing transaction: ...working... done 2025-05-07T20:23:52.2639236Z Executing transaction: ...working... done 2025-05-07T20:23:52.9206845Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:23:53.0102485Z installation finished. 2025-05-07T20:23:53.0109687Z 2025-05-07T20:23:53.0109994Z + rm -f miniconda.sh 2025-05-07T20:23:53.0110146Z 2025-05-07T20:23:53.0425894Z 2025-05-07T20:23:53.0426163Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:23:53.0426510Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:23:53.0426734Z 2025-05-07T20:23:53.4143991Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:23:53.4144484Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:23:53.4144833Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:23:53.4145213Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:23:53.4145565Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:23:53.4146365Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:23:53.4146797Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:23:53.4147228Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:23:53.4147677Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:23:53.4148202Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:23:53.4148714Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:23:53.4149067Z no change /home/ec2-user/.bashrc 2025-05-07T20:23:53.4149329Z No action taken. 2025-05-07T20:23:53.4805846Z 2025-05-07T20:23:53.4806295Z + . /home/ec2-user/.bashrc 2025-05-07T20:23:53.4806928Z 2025-05-07T20:23:54.3259607Z 2025-05-07T20:23:54.3260480Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:23:54.3284535Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:07.7679165Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:09.3455567Z Solving environment: | / - \ | / - \ | / - \ done 2025-05-07T20:24:09.4427377Z 2025-05-07T20:24:09.4427513Z ## Package Plan ## 2025-05-07T20:24:09.4427765Z 2025-05-07T20:24:09.4427905Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:09.4428145Z 2025-05-07T20:24:09.4428242Z added / updated specs: 2025-05-07T20:24:09.4428502Z - conda-libmamba-solver 2025-05-07T20:24:09.4428763Z - libarchive 2025-05-07T20:24:09.4428969Z - libmamba 2025-05-07T20:24:09.4429171Z - libmambapy 2025-05-07T20:24:09.4429296Z 2025-05-07T20:24:09.4429299Z 2025-05-07T20:24:09.4429418Z The following packages will be downloaded: 2025-05-07T20:24:09.4429632Z 2025-05-07T20:24:09.4429744Z package | build 2025-05-07T20:24:09.4430062Z ---------------------------|----------------- 2025-05-07T20:24:09.4430466Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:09.4430926Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:09.4431348Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:09.4431817Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:09.4432257Z ------------------------------------------------------------ 2025-05-07T20:24:09.4432600Z Total: 1.4 MB 2025-05-07T20:24:09.4432806Z 2025-05-07T20:24:09.4432921Z The following packages will be UPDATED: 2025-05-07T20:24:09.4433130Z 2025-05-07T20:24:09.4436557Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:09.4437332Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:09.4437701Z 2025-05-07T20:24:09.4437925Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:09.4438231Z 2025-05-07T20:24:09.4438543Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:09.4439330Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:09.4439800Z 2025-05-07T20:24:09.4439805Z 2025-05-07T20:24:09.4439809Z 2025-05-07T20:24:09.4440201Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:09.4440567Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:09.4440780Z 2025-05-07T20:24:09.4441052Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:09.4441281Z 2025-05-07T20:24:09.4441285Z 2025-05-07T20:24:09.4449357Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:09.4449696Z 2025-05-07T20:24:09.4449700Z 2025-05-07T20:24:09.4451885Z 2025-05-07T20:24:09.5093189Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:09.5093473Z 2025-05-07T20:24:09.5093477Z 2025-05-07T20:24:09.5093568Z 2025-05-07T20:24:09.5197173Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:09.5197808Z 2025-05-07T20:24:09.5197812Z 2025-05-07T20:24:09.5197816Z 2025-05-07T20:24:09.5276451Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:09.5277166Z 2025-05-07T20:24:09.5277174Z 2025-05-07T20:24:09.5341845Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:09.5342179Z 2025-05-07T20:24:09.5342193Z 2025-05-07T20:24:09.5410543Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:09.5410828Z 2025-05-07T20:24:09.5429396Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:09.5499632Z conda-25.3.1 | 1.1 MB | #1 | 11% 2025-05-07T20:24:09.5499961Z 2025-05-07T20:24:09.5500636Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:09.5500891Z 2025-05-07T20:24:09.5830393Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:09.6904326Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:09.6904765Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:09.6910662Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:09.6911103Z 2025-05-07T20:24:09.6911375Z 2025-05-07T20:24:09.6911732Z  2025-05-07T20:24:09.6911937Z 2025-05-07T20:24:09.6911941Z 2025-05-07T20:24:09.6912119Z  2025-05-07T20:24:09.6912326Z 2025-05-07T20:24:09.6912330Z 2025-05-07T20:24:09.6912334Z 2025-05-07T20:24:09.6912526Z  done 2025-05-07T20:24:09.7915329Z Preparing transaction: / done 2025-05-07T20:24:09.8918284Z Verifying transaction: \ done 2025-05-07T20:24:11.2940424Z Executing transaction: / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:13.2107222Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:13.2132528Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:14.0421571Z Channels: 2025-05-07T20:24:14.0421893Z - defaults 2025-05-07T20:24:14.0422166Z Platform: linux-64 2025-05-07T20:24:15.2828299Z Collecting package metadata (repodata.json): - \ | / - \ | done 2025-05-07T20:24:15.4054267Z Solving environment: - \ Channels: 2025-05-07T20:24:15.4055080Z - defaults 2025-05-07T20:24:15.4055388Z Platform: linux-64 2025-05-07T20:24:15.7003025Z Collecting package metadata (repodata.json): / - \ | done 2025-05-07T20:24:15.9160214Z Solving environment: - \ | / done 2025-05-07T20:24:15.9944664Z done 2025-05-07T20:24:16.0587513Z 2025-05-07T20:24:16.0587756Z ## Package Plan ## 2025-05-07T20:24:16.0587900Z 2025-05-07T20:24:16.0588070Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:16.0588301Z 2025-05-07T20:24:16.0588395Z added / updated specs: 2025-05-07T20:24:16.0588631Z - conda 2025-05-07T20:24:16.0588746Z 2025-05-07T20:24:16.0588771Z 2025-05-07T20:24:16.0588894Z The following packages will be downloaded: 2025-05-07T20:24:16.0589101Z 2025-05-07T20:24:16.0589213Z package | build 2025-05-07T20:24:16.0589767Z ---------------------------|----------------- 2025-05-07T20:24:16.0590117Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:16.0590502Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:16.0590866Z ------------------------------------------------------------ 2025-05-07T20:24:16.0591201Z Total: 1.4 MB 2025-05-07T20:24:16.0591403Z 2025-05-07T20:24:16.0591522Z The following packages will be UPDATED: 2025-05-07T20:24:16.0591720Z 2025-05-07T20:24:16.0592017Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:16.0592510Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:16.0592920Z 2025-05-07T20:24:16.0592924Z 2025-05-07T20:24:16.0592928Z 2025-05-07T20:24:16.0593067Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:16.0593428Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:16.0593863Z 2025-05-07T20:24:16.0913713Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:16.0913970Z 2025-05-07T20:24:16.1533478Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:16.2808754Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:16.2809088Z 2025-05-07T20:24:16.2811287Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:16.2811598Z 2025-05-07T20:24:16.3212646Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:16.3213132Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:16.3217190Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:16.3217633Z 2025-05-07T20:24:16.3217927Z 2025-05-07T20:24:16.3218196Z  done 2025-05-07T20:24:16.4220799Z Preparing transaction: \ done 2025-05-07T20:24:16.5226201Z Verifying transaction: / done 2025-05-07T20:24:18.6252314Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:19.2623347Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:19.2627791Z + conda clean --packages --tarball -y 2025-05-07T20:24:19.2627997Z 2025-05-07T20:24:20.2691291Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:20.2691692Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:20.3343689Z 2025-05-07T20:24:20.3352125Z + conda clean --all -y 2025-05-07T20:24:20.3352380Z 2025-05-07T20:24:20.8870262Z There are no unused tarball(s) to remove. 2025-05-07T20:24:20.8870672Z Will remove 1 index cache(s). 2025-05-07T20:24:20.8870964Z There are no unused package(s) to remove. 2025-05-07T20:24:20.8871317Z There are no tempfile(s) to remove. 2025-05-07T20:24:20.8871616Z There are no logfile(s) to remove. 2025-05-07T20:24:20.9520178Z 2025-05-07T20:24:20.9524736Z + conda info 2025-05-07T20:24:20.9525180Z 2025-05-07T20:24:21.7073130Z 2025-05-07T20:24:21.7073794Z active environment : base 2025-05-07T20:24:21.7074267Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:21.7074696Z shell level : 1 2025-05-07T20:24:21.7075046Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:21.7075558Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:21.7076040Z conda version : 25.3.1 2025-05-07T20:24:21.7076366Z conda-build version : not installed 2025-05-07T20:24:21.7076676Z python version : 3.13.2.final.0 2025-05-07T20:24:21.7077090Z solver : libmamba (default) 2025-05-07T20:24:21.7077497Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:21.7077900Z __conda=25.3.1=0 2025-05-07T20:24:21.7078284Z __cuda=12.8=0 2025-05-07T20:24:21.7078769Z __glibc=2.34=0 2025-05-07T20:24:21.7079043Z __linux=6.1.130=0 2025-05-07T20:24:21.7079672Z __unix=0=0 2025-05-07T20:24:21.7080010Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:21.7080404Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:21.7080754Z conda av metadata url : None 2025-05-07T20:24:21.7081124Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:21.7081561Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:21.7081932Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:21.7082307Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:21.7082668Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:21.7083148Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:21.7083484Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:21.7083815Z /home/ec2-user/.conda/envs 2025-05-07T20:24:21.7084117Z platform : linux-64 2025-05-07T20:24:21.7084927Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:21.7085737Z UID:GID : 1000:1000 2025-05-07T20:24:21.7086012Z netrc file : None 2025-05-07T20:24:21.7086261Z offline mode : False 2025-05-07T20:24:21.7086430Z 2025-05-07T20:24:21.7756740Z 2025-05-07T20:24:21.7757178Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:21.7758126Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_2db73161-039f-40ef-95f2-00b3cc70e8e1 ... 2025-05-07T20:24:21.7758917Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:21.7840890Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:21.7841373Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.13 2025-05-07T20:24:21.7857411Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:21.7857794Z env: 2025-05-07T20:24:21.7858015Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:21.7858309Z BUILD_ENV: build_binary 2025-05-07T20:24:21.7858550Z BUILD_TARGET: genai 2025-05-07T20:24:21.7858782Z BUILD_VARIANT: cuda 2025-05-07T20:24:21.7859010Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:21.7859251Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:21.7859551Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:21.7859876Z ##[endgroup] 2025-05-07T20:24:22.1236529Z ################################################################################ 2025-05-07T20:24:22.1237063Z # Create Conda Environment 2025-05-07T20:24:22.1237453Z # 2025-05-07T20:24:22.1251820Z # [2025-05-07T20:24:22.124Z] + create_conda_environment build_binary 3.13 2025-05-07T20:24:22.1252314Z ################################################################################ 2025-05-07T20:24:22.1252539Z 2025-05-07T20:24:22.1266970Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:22.2202418Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:22.2202778Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:22.2203106Z + conda info --envs 2025-05-07T20:24:22.2203251Z 2025-05-07T20:24:22.9727630Z 2025-05-07T20:24:22.9728172Z # conda environments: 2025-05-07T20:24:22.9728451Z # 2025-05-07T20:24:22.9728672Z base /home/ec2-user/miniconda 2025-05-07T20:24:22.9728889Z 2025-05-07T20:24:23.0406364Z 2025-05-07T20:24:23.0407268Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:24.6984927Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:24.6985234Z 2025-05-07T20:24:24.6998487Z 2025-05-07T20:24:24.7007927Z [SETUP] Creating new Conda environment (Python 3.13) ... 2025-05-07T20:24:24.7031020Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.13 2025-05-07T20:24:25.4590957Z Channels: 2025-05-07T20:24:25.4591202Z - defaults 2025-05-07T20:24:25.4591852Z Platform: linux-64 2025-05-07T20:24:27.0486725Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done 2025-05-07T20:24:27.1490922Z Solving environment: - done 2025-05-07T20:24:27.1781593Z 2025-05-07T20:24:27.1781858Z ## Package Plan ## 2025-05-07T20:24:27.1782020Z 2025-05-07T20:24:27.1782247Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:27.1782594Z 2025-05-07T20:24:27.1782714Z added / updated specs: 2025-05-07T20:24:27.1782960Z - python=3.13 2025-05-07T20:24:27.1783092Z 2025-05-07T20:24:27.1783096Z 2025-05-07T20:24:27.1783214Z The following packages will be downloaded: 2025-05-07T20:24:27.1783856Z 2025-05-07T20:24:27.1783983Z package | build 2025-05-07T20:24:27.1784301Z ---------------------------|----------------- 2025-05-07T20:24:27.1784658Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:27.1785050Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:27.1785458Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:27.1785866Z python_abi-3.13 | 0_cp313 6 KB 2025-05-07T20:24:27.1786223Z ------------------------------------------------------------ 2025-05-07T20:24:27.1786558Z Total: 159 KB 2025-05-07T20:24:27.1786762Z 2025-05-07T20:24:27.1786893Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:27.1787107Z 2025-05-07T20:24:27.1787306Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:27.1787744Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:27.1788374Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:27.1788850Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:27.1789317Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:27.1789755Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:27.1790201Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:27.1790623Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:27.1791045Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:27.1791468Z libmpdec pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 2025-05-07T20:24:27.1791919Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:27.1792364Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:27.1792778Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:27.1793190Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:27.1793588Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:27.1793997Z python pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 2025-05-07T20:24:27.1794425Z python_abi pkgs/main/linux-64::python_abi-3.13-0_cp313 2025-05-07T20:24:27.1794844Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:27.1795305Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 2025-05-07T20:24:27.1795758Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:27.1796140Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:27.1796511Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:27.1796924Z wheel pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 2025-05-07T20:24:27.1797307Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:27.1797669Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:27.1797906Z 2025-05-07T20:24:27.1797910Z 2025-05-07T20:24:27.1797914Z 2025-05-07T20:24:27.1798052Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:27.1798703Z ca-certificates-2025 | 129 KB | | 0% 2025-05-07T20:24:27.1798928Z 2025-05-07T20:24:27.1799225Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:27.1799458Z 2025-05-07T20:24:27.1799462Z 2025-05-07T20:24:27.1809617Z python_abi-3.13 | 6 KB | | 0%  2025-05-07T20:24:27.1809866Z 2025-05-07T20:24:27.1809869Z 2025-05-07T20:24:27.1809876Z 2025-05-07T20:24:27.2166330Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:27.2166703Z 2025-05-07T20:24:27.2167002Z 2025-05-07T20:24:27.2172249Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:27.2248337Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:27.2249708Z 2025-05-07T20:24:27.2291618Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:27.2291960Z 2025-05-07T20:24:27.2291964Z 2025-05-07T20:24:27.2293348Z 2025-05-07T20:24:27.2338185Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:27.2338449Z 2025-05-07T20:24:27.2339880Z 2025-05-07T20:24:27.2413840Z python_abi-3.13 | 6 KB | ########## | 100%  2025-05-07T20:24:27.2430193Z ca-certificates-2025 | 129 KB | ########## | 100% 2025-05-07T20:24:27.2430504Z 2025-05-07T20:24:27.2430510Z 2025-05-07T20:24:27.2430515Z 2025-05-07T20:24:27.2524075Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:27.2524443Z 2025-05-07T20:24:27.2529926Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:27.2530465Z 2025-05-07T20:24:27.2530746Z 2025-05-07T20:24:27.2531321Z  2025-05-07T20:24:27.2531626Z 2025-05-07T20:24:27.2531631Z 2025-05-07T20:24:27.2531887Z  2025-05-07T20:24:27.2532128Z 2025-05-07T20:24:27.2532132Z 2025-05-07T20:24:27.2532136Z 2025-05-07T20:24:27.2532389Z  done 2025-05-07T20:24:27.4638625Z Preparing transaction: | / done 2025-05-07T20:24:28.8900758Z Verifying transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:31.3082615Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:31.3584987Z # 2025-05-07T20:24:31.3585334Z # To activate this environment, use 2025-05-07T20:24:31.3585738Z # 2025-05-07T20:24:31.3586019Z # $ conda activate build_binary 2025-05-07T20:24:31.3586434Z # 2025-05-07T20:24:31.3586738Z # To deactivate an active environment, use 2025-05-07T20:24:31.3587146Z # 2025-05-07T20:24:31.3587405Z # $ conda deactivate 2025-05-07T20:24:31.3587619Z 2025-05-07T20:24:31.4684639Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:31.4706335Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:34.3308876Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1) 2025-05-07T20:24:34.3310423Z Collecting pip 2025-05-07T20:24:34.3310750Z Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:34.3311151Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:34.3311499Z Installing collected packages: pip 2025-05-07T20:24:34.3311791Z Attempting uninstall: pip 2025-05-07T20:24:34.3312061Z Found existing installation: pip 25.1 2025-05-07T20:24:34.3312368Z Uninstalling pip-25.1: 2025-05-07T20:24:34.3312665Z Successfully uninstalled pip-25.1 2025-05-07T20:24:34.3312974Z Successfully installed pip-25.1.1 2025-05-07T20:24:34.3313167Z 2025-05-07T20:24:34.3953524Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:34.3975671Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:35.2519654Z Channels: 2025-05-07T20:24:35.2519910Z - conda-forge 2025-05-07T20:24:35.2520144Z Platform: linux-64 2025-05-07T20:24:45.8507389Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:24:47.5474263Z Solving environment: | / - \ | / done 2025-05-07T20:24:47.6117874Z 2025-05-07T20:24:47.6118336Z ## Package Plan ## 2025-05-07T20:24:47.6118567Z 2025-05-07T20:24:47.6118855Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:47.6119276Z 2025-05-07T20:24:47.6119403Z added / updated specs: 2025-05-07T20:24:47.6120217Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:24:47.6120443Z 2025-05-07T20:24:47.6120448Z 2025-05-07T20:24:47.6120584Z The following packages will be downloaded: 2025-05-07T20:24:47.6120793Z 2025-05-07T20:24:47.6120909Z package | build 2025-05-07T20:24:47.6121226Z ---------------------------|----------------- 2025-05-07T20:24:47.6121589Z cffi-1.17.1 | py313hfab6e84_0 289 KB conda-forge 2025-05-07T20:24:47.6122025Z cryptography-44.0.3 | py313h6556f6e_0 1.5 MB conda-forge 2025-05-07T20:24:47.6122566Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:24:47.6123073Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:24:47.6123483Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:24:47.6123877Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:24:47.6124313Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:24:47.6124931Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:24:47.6125402Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:24:47.6125875Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:24:47.6126282Z ------------------------------------------------------------ 2025-05-07T20:24:47.6126623Z Total: 6.4 MB 2025-05-07T20:24:47.6126838Z 2025-05-07T20:24:47.6126964Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:47.6127178Z 2025-05-07T20:24:47.6127380Z cffi conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 2025-05-07T20:24:47.6127862Z cryptography conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 2025-05-07T20:24:47.6128349Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:24:47.6129066Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:24:47.6129699Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:24:47.6130444Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:24:47.6131306Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:24:47.6131821Z 2025-05-07T20:24:47.6131995Z The following packages will be UPDATED: 2025-05-07T20:24:47.6132328Z 2025-05-07T20:24:47.6132937Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:47.6134192Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:24:47.6135120Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:24:47.6135806Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:24:47.6136173Z 2025-05-07T20:24:47.6136182Z 2025-05-07T20:24:47.6136186Z 2025-05-07T20:24:47.6136335Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:47.6136718Z openssl-3.5.0 | 3.0 MB | | 0% 2025-05-07T20:24:47.6136939Z 2025-05-07T20:24:47.6137315Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:24:47.6137563Z 2025-05-07T20:24:47.6137566Z 2025-05-07T20:24:47.6144472Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:24:47.6144808Z 2025-05-07T20:24:47.6144814Z 2025-05-07T20:24:47.6144819Z 2025-05-07T20:24:47.6156582Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:24:47.6156827Z 2025-05-07T20:24:47.6156830Z 2025-05-07T20:24:47.6156842Z 2025-05-07T20:24:47.6156848Z 2025-05-07T20:24:47.6172673Z cffi-1.17.1 | 289 KB | | 0%  2025-05-07T20:24:47.6173050Z 2025-05-07T20:24:47.6173053Z 2025-05-07T20:24:47.6173057Z 2025-05-07T20:24:47.6173067Z 2025-05-07T20:24:47.6178612Z 2025-05-07T20:24:47.6180697Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:24:47.6181053Z 2025-05-07T20:24:47.6181059Z 2025-05-07T20:24:47.6181075Z 2025-05-07T20:24:47.6181080Z 2025-05-07T20:24:47.6181085Z 2025-05-07T20:24:47.6181090Z 2025-05-07T20:24:47.6182559Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:24:47.6182943Z 2025-05-07T20:24:47.6182947Z 2025-05-07T20:24:47.6182950Z 2025-05-07T20:24:47.6182954Z 2025-05-07T20:24:47.6182957Z 2025-05-07T20:24:47.6182961Z 2025-05-07T20:24:47.6182965Z 2025-05-07T20:24:47.6186485Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:24:47.6186904Z 2025-05-07T20:24:47.6186909Z 2025-05-07T20:24:47.6186913Z 2025-05-07T20:24:47.6186916Z 2025-05-07T20:24:47.6186920Z 2025-05-07T20:24:47.6186924Z 2025-05-07T20:24:47.6186940Z 2025-05-07T20:24:47.6186943Z 2025-05-07T20:24:47.6190050Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:24:47.6190362Z 2025-05-07T20:24:47.6190366Z 2025-05-07T20:24:47.6190370Z 2025-05-07T20:24:47.6190373Z 2025-05-07T20:24:47.6190377Z 2025-05-07T20:24:47.6190381Z 2025-05-07T20:24:47.6190384Z 2025-05-07T20:24:47.6190388Z 2025-05-07T20:24:47.6190391Z 2025-05-07T20:24:47.6861454Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:24:47.6861739Z 2025-05-07T20:24:47.6862391Z 2025-05-07T20:24:47.6993790Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.6994036Z 2025-05-07T20:24:47.6994040Z 2025-05-07T20:24:47.6995674Z 2025-05-07T20:24:47.7120926Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.7247449Z openssl-3.5.0 | 3.0 MB | ########9 | 90% 2025-05-07T20:24:47.7254026Z 2025-05-07T20:24:47.7258899Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:47.7260806Z 2025-05-07T20:24:47.7313951Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:47.7314217Z 2025-05-07T20:24:47.7314222Z 2025-05-07T20:24:47.7314228Z 2025-05-07T20:24:47.7314232Z 2025-05-07T20:24:47.7314237Z 2025-05-07T20:24:47.7314242Z 2025-05-07T20:24:47.7373851Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:24:47.7374127Z 2025-05-07T20:24:47.7374131Z 2025-05-07T20:24:47.7374135Z 2025-05-07T20:24:47.7374139Z 2025-05-07T20:24:47.7376122Z 2025-05-07T20:24:47.7408244Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:24:47.7408517Z 2025-05-07T20:24:47.7408520Z 2025-05-07T20:24:47.7408524Z 2025-05-07T20:24:47.7408528Z 2025-05-07T20:24:47.7408531Z 2025-05-07T20:24:47.7408535Z 2025-05-07T20:24:47.7458692Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.7458958Z 2025-05-07T20:24:47.7458962Z 2025-05-07T20:24:47.7458973Z 2025-05-07T20:24:47.7460990Z 2025-05-07T20:24:47.7484500Z cffi-1.17.1 | 289 KB | 5 | 6%  2025-05-07T20:24:47.7484743Z 2025-05-07T20:24:47.7484746Z 2025-05-07T20:24:47.7484750Z 2025-05-07T20:24:47.7484754Z 2025-05-07T20:24:47.7488916Z 2025-05-07T20:24:47.7732020Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:47.7732290Z 2025-05-07T20:24:47.7732294Z 2025-05-07T20:24:47.7732297Z 2025-05-07T20:24:47.7738179Z 2025-05-07T20:24:47.7752690Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:47.7752931Z 2025-05-07T20:24:47.7752935Z 2025-05-07T20:24:47.7752938Z 2025-05-07T20:24:47.7752942Z 2025-05-07T20:24:47.7752946Z 2025-05-07T20:24:47.7752949Z 2025-05-07T20:24:47.7752953Z 2025-05-07T20:24:47.7754063Z 2025-05-07T20:24:47.7800012Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:24:47.7800309Z 2025-05-07T20:24:47.7800313Z 2025-05-07T20:24:47.7800531Z 2025-05-07T20:24:47.7800535Z 2025-05-07T20:24:47.7800538Z 2025-05-07T20:24:47.7800542Z 2025-05-07T20:24:47.7800550Z 2025-05-07T20:24:47.7800720Z 2025-05-07T20:24:47.7825477Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:47.7825900Z 2025-05-07T20:24:47.7825906Z 2025-05-07T20:24:47.7825910Z 2025-05-07T20:24:47.7825921Z 2025-05-07T20:24:47.7825925Z 2025-05-07T20:24:47.7825928Z 2025-05-07T20:24:47.7825932Z 2025-05-07T20:24:47.7825935Z 2025-05-07T20:24:47.7825939Z 2025-05-07T20:24:47.7834610Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:24:47.7834885Z 2025-05-07T20:24:47.7834889Z 2025-05-07T20:24:47.7834892Z 2025-05-07T20:24:47.7834896Z 2025-05-07T20:24:47.7834899Z 2025-05-07T20:24:47.7834903Z 2025-05-07T20:24:47.7834906Z 2025-05-07T20:24:47.7865387Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:24:47.7865684Z 2025-05-07T20:24:47.7865700Z 2025-05-07T20:24:47.7865703Z 2025-05-07T20:24:47.7865707Z 2025-05-07T20:24:47.7865711Z 2025-05-07T20:24:47.7865714Z 2025-05-07T20:24:47.7865929Z 2025-05-07T20:24:47.7865935Z 2025-05-07T20:24:47.7866422Z 2025-05-07T20:24:47.7874352Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:47.7894192Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:47.7894461Z 2025-05-07T20:24:47.7894466Z 2025-05-07T20:24:47.7894471Z 2025-05-07T20:24:47.7894475Z 2025-05-07T20:24:47.7894480Z 2025-05-07T20:24:47.7894484Z 2025-05-07T20:24:47.7896266Z 2025-05-07T20:24:47.8104544Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:47.8104952Z 2025-05-07T20:24:47.8104958Z 2025-05-07T20:24:47.8104963Z 2025-05-07T20:24:47.8110886Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.8111249Z 2025-05-07T20:24:47.8111255Z 2025-05-07T20:24:47.8111260Z 2025-05-07T20:24:47.8383394Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:24:47.8384050Z 2025-05-07T20:24:47.8384058Z 2025-05-07T20:24:47.8388452Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.8388790Z 2025-05-07T20:24:47.8388794Z 2025-05-07T20:24:47.8855517Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:24:47.8855845Z 2025-05-07T20:24:47.8855851Z 2025-05-07T20:24:47.8855865Z 2025-05-07T20:24:47.8855870Z 2025-05-07T20:24:47.8855875Z 2025-05-07T20:24:47.8859144Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:47.8859498Z 2025-05-07T20:24:47.8859502Z 2025-05-07T20:24:47.8859512Z 2025-05-07T20:24:47.8859516Z 2025-05-07T20:24:47.8859585Z 2025-05-07T20:24:47.9564736Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:24:47.9565015Z 2025-05-07T20:24:47.9565019Z 2025-05-07T20:24:47.9565023Z 2025-05-07T20:24:47.9565147Z 2025-05-07T20:24:47.9570667Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:47.9570966Z 2025-05-07T20:24:47.9570978Z 2025-05-07T20:24:47.9570987Z 2025-05-07T20:24:47.9570991Z 2025-05-07T20:24:47.9582598Z cffi-1.17.1 | 289 KB | ########## | 100%  2025-05-07T20:24:47.9582924Z 2025-05-07T20:24:47.9582929Z 2025-05-07T20:24:47.9582944Z 2025-05-07T20:24:47.9582950Z 2025-05-07T20:24:47.9582954Z 2025-05-07T20:24:47.9583464Z 2025-05-07T20:24:47.9587209Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.9587547Z 2025-05-07T20:24:47.9587551Z 2025-05-07T20:24:47.9587555Z 2025-05-07T20:24:47.9587558Z 2025-05-07T20:24:47.9587562Z 2025-05-07T20:24:47.9587673Z 2025-05-07T20:24:47.9738616Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:24:47.9739004Z 2025-05-07T20:24:47.9739010Z 2025-05-07T20:24:47.9739015Z 2025-05-07T20:24:47.9739020Z 2025-05-07T20:24:47.9739026Z 2025-05-07T20:24:47.9739032Z 2025-05-07T20:24:47.9739249Z 2025-05-07T20:24:47.9739252Z 2025-05-07T20:24:47.9743345Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:47.9743733Z 2025-05-07T20:24:47.9743737Z 2025-05-07T20:24:47.9743740Z 2025-05-07T20:24:47.9743744Z 2025-05-07T20:24:47.9743747Z 2025-05-07T20:24:47.9743751Z 2025-05-07T20:24:47.9743755Z 2025-05-07T20:24:47.9743758Z 2025-05-07T20:24:48.0070127Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:24:48.0070524Z 2025-05-07T20:24:48.0070528Z 2025-05-07T20:24:48.0070532Z 2025-05-07T20:24:48.0070535Z 2025-05-07T20:24:48.0070539Z 2025-05-07T20:24:48.0070543Z 2025-05-07T20:24:48.0070547Z 2025-05-07T20:24:48.0070550Z 2025-05-07T20:24:48.0071703Z 2025-05-07T20:24:48.0075666Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:48.0076031Z 2025-05-07T20:24:48.0076035Z 2025-05-07T20:24:48.0076039Z 2025-05-07T20:24:48.0076042Z 2025-05-07T20:24:48.0076056Z 2025-05-07T20:24:48.0076060Z 2025-05-07T20:24:48.0076063Z 2025-05-07T20:24:48.0076067Z 2025-05-07T20:24:48.0076144Z 2025-05-07T20:24:48.0274177Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:24:48.0274503Z 2025-05-07T20:24:48.0274507Z 2025-05-07T20:24:48.0274510Z 2025-05-07T20:24:48.0274514Z 2025-05-07T20:24:48.0274518Z 2025-05-07T20:24:48.0274528Z 2025-05-07T20:24:48.0274532Z 2025-05-07T20:24:48.0278085Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:48.0278378Z 2025-05-07T20:24:48.0278382Z 2025-05-07T20:24:48.0278385Z 2025-05-07T20:24:48.0278396Z 2025-05-07T20:24:48.0278400Z 2025-05-07T20:24:48.0278403Z 2025-05-07T20:24:48.0278528Z 2025-05-07T20:24:48.0673876Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:24:48.0674286Z 2025-05-07T20:24:48.1020547Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:24:48.1026829Z openssl-3.5.0 | 3.0 MB | ########## | 100% 2025-05-07T20:24:48.1027187Z 2025-05-07T20:24:48.1027390Z 2025-05-07T20:24:48.1027593Z  2025-05-07T20:24:48.1027791Z 2025-05-07T20:24:48.1027795Z 2025-05-07T20:24:48.1027958Z  2025-05-07T20:24:48.1028165Z 2025-05-07T20:24:48.1028168Z 2025-05-07T20:24:48.1028172Z 2025-05-07T20:24:48.1028348Z  2025-05-07T20:24:48.1028556Z 2025-05-07T20:24:48.1028560Z 2025-05-07T20:24:48.1028563Z 2025-05-07T20:24:48.1028567Z 2025-05-07T20:24:48.1028736Z  2025-05-07T20:24:48.1028946Z 2025-05-07T20:24:48.1028949Z 2025-05-07T20:24:48.1028953Z 2025-05-07T20:24:48.1028957Z 2025-05-07T20:24:48.1028960Z 2025-05-07T20:24:48.1029130Z  2025-05-07T20:24:48.1029338Z 2025-05-07T20:24:48.1029349Z 2025-05-07T20:24:48.1029352Z 2025-05-07T20:24:48.1029356Z 2025-05-07T20:24:48.1029364Z 2025-05-07T20:24:48.1029368Z 2025-05-07T20:24:48.1029546Z  2025-05-07T20:24:48.1029753Z 2025-05-07T20:24:48.1029756Z 2025-05-07T20:24:48.1029768Z 2025-05-07T20:24:48.1029772Z 2025-05-07T20:24:48.1029776Z 2025-05-07T20:24:48.1029779Z 2025-05-07T20:24:48.1029783Z 2025-05-07T20:24:48.1029961Z  2025-05-07T20:24:48.1030171Z 2025-05-07T20:24:48.1030181Z 2025-05-07T20:24:48.1030185Z 2025-05-07T20:24:48.1030188Z 2025-05-07T20:24:48.1030192Z 2025-05-07T20:24:48.1030195Z 2025-05-07T20:24:48.1030199Z 2025-05-07T20:24:48.1030203Z 2025-05-07T20:24:48.1030386Z  2025-05-07T20:24:48.1030667Z 2025-05-07T20:24:48.1030672Z 2025-05-07T20:24:48.1030883Z 2025-05-07T20:24:48.1030888Z 2025-05-07T20:24:48.1030893Z 2025-05-07T20:24:48.1030898Z 2025-05-07T20:24:48.1030910Z 2025-05-07T20:24:48.1030915Z 2025-05-07T20:24:48.1030920Z 2025-05-07T20:24:48.1031201Z  done 2025-05-07T20:24:48.2041730Z Preparing transaction: \ done 2025-05-07T20:24:48.3046957Z Verifying transaction: / done 2025-05-07T20:24:49.8071688Z Executing transaction: \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:49.9862751Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:24:51.7208595Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:24:51.7221885Z [SETUP] Installing libxcrypt ... 2025-05-07T20:24:51.7244614Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:24:52.5903032Z Channels: 2025-05-07T20:24:52.5903357Z - conda-forge 2025-05-07T20:24:52.5903661Z Platform: linux-64 2025-05-07T20:24:55.8913022Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:56.2592939Z Solving environment: \ done 2025-05-07T20:24:56.3216640Z 2025-05-07T20:24:56.3217185Z ## Package Plan ## 2025-05-07T20:24:56.3217392Z 2025-05-07T20:24:56.3217607Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:56.3217904Z 2025-05-07T20:24:56.3217999Z added / updated specs: 2025-05-07T20:24:56.3218248Z - libxcrypt 2025-05-07T20:24:56.3218381Z 2025-05-07T20:24:56.3218386Z 2025-05-07T20:24:56.3218510Z The following packages will be downloaded: 2025-05-07T20:24:56.3218724Z 2025-05-07T20:24:56.3218847Z package | build 2025-05-07T20:24:56.3219161Z ---------------------------|----------------- 2025-05-07T20:24:56.3219537Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:24:56.3219940Z ------------------------------------------------------------ 2025-05-07T20:24:56.3220302Z Total: 98 KB 2025-05-07T20:24:56.3220521Z 2025-05-07T20:24:56.3220656Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:56.3220878Z 2025-05-07T20:24:56.3221096Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:24:56.3221376Z 2025-05-07T20:24:56.3221380Z 2025-05-07T20:24:56.3221384Z 2025-05-07T20:24:56.3221537Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:56.4852687Z libxcrypt-4.4.36 | 98 KB | | 0% 2025-05-07T20:24:56.4879501Z libxcrypt-4.4.36 | 98 KB | #6 | 16% 2025-05-07T20:24:56.4982215Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:56.4984722Z libxcrypt-4.4.36 | 98 KB | ########## | 100% 2025-05-07T20:24:56.4985198Z 2025-05-07T20:24:56.4985493Z done 2025-05-07T20:24:56.5988550Z Preparing transaction: / done 2025-05-07T20:24:56.6993963Z Verifying transaction: \ done 2025-05-07T20:24:56.8001140Z Executing transaction: / done 2025-05-07T20:25:00.2593053Z [SETUP] Copying over ... 2025-05-07T20:25:00.2593762Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h 2025-05-07T20:25:00.2594290Z 2025-05-07T20:25:00.2623556Z 2025-05-07T20:25:01.9166359Z [SETUP] Installed Python version: Python 3.13.2 2025-05-07T20:25:01.9167492Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:01.9199121Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:01.9199726Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:01.9212416Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:01.9212763Z env: 2025-05-07T20:25:01.9212982Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:01.9213277Z BUILD_ENV: build_binary 2025-05-07T20:25:01.9213536Z BUILD_TARGET: genai 2025-05-07T20:25:01.9214072Z BUILD_VARIANT: cuda 2025-05-07T20:25:01.9214295Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:01.9214547Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:01.9214844Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:01.9215163Z ##[endgroup] 2025-05-07T20:25:02.2607434Z ################################################################################ 2025-05-07T20:25:02.2607920Z # Install C/C++ Compilers 2025-05-07T20:25:02.2608189Z # 2025-05-07T20:25:02.2624631Z # [2025-05-07T20:25:02.262Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:02.2625164Z ################################################################################ 2025-05-07T20:25:02.2625465Z 2025-05-07T20:25:02.2642113Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:02.3567841Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:02.3578553Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:02.3601036Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:03.2238371Z Channels: 2025-05-07T20:25:03.2238664Z - conda-forge 2025-05-07T20:25:03.2238924Z Platform: linux-64 2025-05-07T20:25:06.5706005Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:06.9410085Z Solving environment: \ done 2025-05-07T20:25:07.0040216Z 2025-05-07T20:25:07.0040735Z ## Package Plan ## 2025-05-07T20:25:07.0040910Z 2025-05-07T20:25:07.0041147Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:07.0041463Z 2025-05-07T20:25:07.0041562Z added / updated specs: 2025-05-07T20:25:07.0041823Z - sysroot_linux-64=2.17 2025-05-07T20:25:07.0041989Z 2025-05-07T20:25:07.0041993Z 2025-05-07T20:25:07.0042120Z The following packages will be downloaded: 2025-05-07T20:25:07.0042330Z 2025-05-07T20:25:07.0042446Z package | build 2025-05-07T20:25:07.0042764Z ---------------------------|----------------- 2025-05-07T20:25:07.0043189Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:07.0043660Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:07.0044060Z ------------------------------------------------------------ 2025-05-07T20:25:07.0044393Z Total: 15.4 MB 2025-05-07T20:25:07.0044595Z 2025-05-07T20:25:07.0044722Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:07.0044946Z 2025-05-07T20:25:07.0045221Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:07.0045767Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:07.0046070Z 2025-05-07T20:25:07.0046074Z 2025-05-07T20:25:07.0046078Z 2025-05-07T20:25:07.0046218Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:07.0046584Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:07.0046810Z 2025-05-07T20:25:07.1096547Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:07.1320165Z sysroot_linux-64-2.1 | 14.5 MB | #3 | 13% 2025-05-07T20:25:07.1320487Z 2025-05-07T20:25:07.1405532Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:07.1408174Z 2025-05-07T20:25:07.2098093Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:07.2648404Z sysroot_linux-64-2.1 | 14.5 MB | #######8 | 78% 2025-05-07T20:25:07.4336578Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:07.4336852Z 2025-05-07T20:25:07.4338418Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:07.4338706Z 2025-05-07T20:25:07.9056448Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:07.9060201Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:07.9060705Z 2025-05-07T20:25:07.9061257Z 2025-05-07T20:25:07.9061529Z  done 2025-05-07T20:25:08.0065362Z Preparing transaction: / done 2025-05-07T20:25:08.2071640Z Verifying transaction: \ | done 2025-05-07T20:25:08.4131059Z Executing transaction: - \ done 2025-05-07T20:25:08.5705281Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:08.5705707Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:10.2769666Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:10.2782856Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:10.2805278Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:11.1688221Z Channels: 2025-05-07T20:25:11.1688481Z - conda-forge 2025-05-07T20:25:11.1688720Z Platform: linux-64 2025-05-07T20:25:14.4785684Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:15.4437234Z Solving environment: \ | / done 2025-05-07T20:25:15.5082662Z 2025-05-07T20:25:15.5083247Z ## Package Plan ## 2025-05-07T20:25:15.5083423Z 2025-05-07T20:25:15.5083634Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:15.5083951Z 2025-05-07T20:25:15.5084058Z added / updated specs: 2025-05-07T20:25:15.5084313Z - gxx_linux-64=11.4.0 2025-05-07T20:25:15.5084473Z 2025-05-07T20:25:15.5084477Z 2025-05-07T20:25:15.5084594Z The following packages will be downloaded: 2025-05-07T20:25:15.5084829Z 2025-05-07T20:25:15.5084948Z package | build 2025-05-07T20:25:15.5085255Z ---------------------------|----------------- 2025-05-07T20:25:15.5085647Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:15.5086122Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:15.5086578Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:15.5087012Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:15.5087439Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:15.5087863Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:15.5088273Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:15.5088729Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:15.5089195Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:15.5089626Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:15.5090084Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:15.5090553Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:15.5090956Z ------------------------------------------------------------ 2025-05-07T20:25:15.5091290Z Total: 91.6 MB 2025-05-07T20:25:15.5091509Z 2025-05-07T20:25:15.5091637Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:15.5091856Z 2025-05-07T20:25:15.5092121Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:15.5092662Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:15.5093538Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:15.5094168Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:15.5094655Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:15.5095145Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:15.5095655Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:15.5096364Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:15.5096850Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:15.5097375Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:15.5097724Z 2025-05-07T20:25:15.5097835Z The following packages will be UPDATED: 2025-05-07T20:25:15.5098041Z 2025-05-07T20:25:15.5098524Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:15.5099218Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:15.5099608Z 2025-05-07T20:25:15.5099612Z 2025-05-07T20:25:15.5099616Z 2025-05-07T20:25:15.5099760Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:15.5100131Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:15.5100356Z 2025-05-07T20:25:15.5100728Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:15.5100962Z 2025-05-07T20:25:15.5100966Z 2025-05-07T20:25:15.5113180Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:15.5113440Z 2025-05-07T20:25:15.5113444Z 2025-05-07T20:25:15.5113448Z 2025-05-07T20:25:15.5125235Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:15.5125496Z 2025-05-07T20:25:15.5125539Z 2025-05-07T20:25:15.5125542Z 2025-05-07T20:25:15.5125663Z 2025-05-07T20:25:15.5158989Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:15.5159242Z 2025-05-07T20:25:15.5159253Z 2025-05-07T20:25:15.5159257Z 2025-05-07T20:25:15.5159260Z 2025-05-07T20:25:15.5172711Z 2025-05-07T20:25:15.5176718Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:15.5177112Z 2025-05-07T20:25:15.5177118Z 2025-05-07T20:25:15.5177124Z 2025-05-07T20:25:15.5177129Z 2025-05-07T20:25:15.5177147Z 2025-05-07T20:25:15.5177153Z 2025-05-07T20:25:15.5178067Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:15.5178456Z 2025-05-07T20:25:15.5178462Z 2025-05-07T20:25:15.5178467Z 2025-05-07T20:25:15.5178472Z 2025-05-07T20:25:15.5178477Z 2025-05-07T20:25:15.5178483Z 2025-05-07T20:25:15.5182264Z 2025-05-07T20:25:15.5183917Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:15.5184310Z 2025-05-07T20:25:15.5184316Z 2025-05-07T20:25:15.5184332Z 2025-05-07T20:25:15.5184337Z 2025-05-07T20:25:15.5184343Z 2025-05-07T20:25:15.5184348Z 2025-05-07T20:25:15.5184353Z 2025-05-07T20:25:15.5186405Z 2025-05-07T20:25:15.5188098Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:15.5188397Z 2025-05-07T20:25:15.5188401Z 2025-05-07T20:25:15.5188404Z 2025-05-07T20:25:15.5188416Z 2025-05-07T20:25:15.5188419Z 2025-05-07T20:25:15.5188423Z 2025-05-07T20:25:15.5188429Z 2025-05-07T20:25:15.5188441Z 2025-05-07T20:25:15.5188444Z 2025-05-07T20:25:15.5190102Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:15.5190425Z 2025-05-07T20:25:15.5190429Z 2025-05-07T20:25:15.5190433Z 2025-05-07T20:25:15.5190444Z 2025-05-07T20:25:15.5190448Z 2025-05-07T20:25:15.5190455Z 2025-05-07T20:25:15.5190459Z 2025-05-07T20:25:15.5190462Z 2025-05-07T20:25:15.5190466Z 2025-05-07T20:25:15.5192651Z 2025-05-07T20:25:15.5194412Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:15.5194735Z 2025-05-07T20:25:15.5194750Z 2025-05-07T20:25:15.5194755Z 2025-05-07T20:25:15.5194761Z 2025-05-07T20:25:15.5194766Z 2025-05-07T20:25:15.5194771Z 2025-05-07T20:25:15.5194776Z 2025-05-07T20:25:15.5194781Z 2025-05-07T20:25:15.5194787Z 2025-05-07T20:25:15.5194792Z 2025-05-07T20:25:15.5194797Z 2025-05-07T20:25:15.6129717Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:15.6130182Z 2025-05-07T20:25:15.6130186Z 2025-05-07T20:25:15.6181443Z 2025-05-07T20:25:15.6439354Z binutils_impl_linux- | 6.0 MB | 2 | 2%  2025-05-07T20:25:15.6439617Z 2025-05-07T20:25:15.6439620Z 2025-05-07T20:25:15.6439624Z 2025-05-07T20:25:15.6439627Z 2025-05-07T20:25:15.7148476Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:15.7148742Z 2025-05-07T20:25:15.7148746Z 2025-05-07T20:25:15.7479332Z 2025-05-07T20:25:15.7583882Z binutils_impl_linux- | 6.0 MB | #9 | 20%  2025-05-07T20:25:15.7584407Z 2025-05-07T20:25:15.7584414Z 2025-05-07T20:25:15.7584422Z 2025-05-07T20:25:15.7669303Z 2025-05-07T20:25:15.7875999Z libstdcxx-15.1.0 | 3.7 MB | #5 | 16%  2025-05-07T20:25:15.8152401Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:15.8152663Z 2025-05-07T20:25:15.8152668Z 2025-05-07T20:25:15.8152673Z 2025-05-07T20:25:15.8465570Z binutils_impl_linux- | 6.0 MB | ####8 | 48%  2025-05-07T20:25:15.8465912Z 2025-05-07T20:25:15.8549630Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:15.8549978Z 2025-05-07T20:25:15.8550110Z 2025-05-07T20:25:15.8772870Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:15.8773227Z 2025-05-07T20:25:15.8773233Z 2025-05-07T20:25:15.8773238Z 2025-05-07T20:25:15.8775397Z 2025-05-07T20:25:15.8778142Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:15.8778514Z 2025-05-07T20:25:15.8778518Z 2025-05-07T20:25:15.8778522Z 2025-05-07T20:25:15.8779883Z 2025-05-07T20:25:15.8878669Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:15.9343720Z gcc_impl_linux-64-11 | 53.0 MB | #1 | 12% 2025-05-07T20:25:15.9343973Z 2025-05-07T20:25:15.9343978Z 2025-05-07T20:25:15.9343982Z 2025-05-07T20:25:15.9343993Z 2025-05-07T20:25:15.9346508Z 2025-05-07T20:25:15.9465189Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:15.9466859Z 2025-05-07T20:25:15.9552777Z gxx_impl_linux-64-11 | 11.2 MB | ###2 | 32%  2025-05-07T20:25:15.9553022Z 2025-05-07T20:25:15.9553709Z 2025-05-07T20:25:15.9884296Z libstdcxx-devel_linu | 11.1 MB | ##8 | 28%  2025-05-07T20:25:16.0001911Z gcc_impl_linux-64-11 | 53.0 MB | #9 | 19% 2025-05-07T20:25:16.0002190Z 2025-05-07T20:25:16.0002195Z 2025-05-07T20:25:16.0007513Z 2025-05-07T20:25:16.0013786Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:16.0014070Z 2025-05-07T20:25:16.0014074Z 2025-05-07T20:25:16.0014078Z 2025-05-07T20:25:16.0353864Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:16.0354157Z 2025-05-07T20:25:16.0354161Z 2025-05-07T20:25:16.0354165Z 2025-05-07T20:25:16.0354169Z 2025-05-07T20:25:16.0356984Z 2025-05-07T20:25:16.0373777Z libsanitizer-11.4.0 | 3.5 MB | #######8 | 79%  2025-05-07T20:25:16.0374160Z 2025-05-07T20:25:16.0374164Z 2025-05-07T20:25:16.0374168Z 2025-05-07T20:25:16.0374172Z 2025-05-07T20:25:16.0374175Z 2025-05-07T20:25:16.0375971Z 2025-05-07T20:25:16.0465298Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:16.0469132Z 2025-05-07T20:25:16.0555531Z gxx_impl_linux-64-11 | 11.2 MB | #####8 | 58%  2025-05-07T20:25:16.0555777Z 2025-05-07T20:25:16.0558192Z 2025-05-07T20:25:16.1452691Z libstdcxx-devel_linu | 11.1 MB | #####2 | 53%  2025-05-07T20:25:16.1465948Z gcc_impl_linux-64-11 | 53.0 MB | ##6 | 27% 2025-05-07T20:25:16.1467547Z 2025-05-07T20:25:16.1559799Z gxx_impl_linux-64-11 | 11.2 MB | ########4 | 84%  2025-05-07T20:25:16.1560037Z 2025-05-07T20:25:16.1561516Z 2025-05-07T20:25:16.2066224Z libstdcxx-devel_linu | 11.1 MB | #######5 | 76%  2025-05-07T20:25:16.2066484Z 2025-05-07T20:25:16.2066631Z 2025-05-07T20:25:16.2066637Z 2025-05-07T20:25:16.2066641Z 2025-05-07T20:25:16.2066644Z 2025-05-07T20:25:16.2074826Z 2025-05-07T20:25:16.2077933Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:16.2078228Z 2025-05-07T20:25:16.2078232Z 2025-05-07T20:25:16.2078236Z 2025-05-07T20:25:16.2078239Z 2025-05-07T20:25:16.2078243Z 2025-05-07T20:25:16.2078247Z 2025-05-07T20:25:16.2111645Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:16.2111931Z 2025-05-07T20:25:16.2111935Z 2025-05-07T20:25:16.2111939Z 2025-05-07T20:25:16.2111942Z 2025-05-07T20:25:16.2111957Z 2025-05-07T20:25:16.2464520Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:16.2469777Z gcc_impl_linux-64-11 | 53.0 MB | ###2 | 33% 2025-05-07T20:25:16.2470008Z 2025-05-07T20:25:16.2470260Z 2025-05-07T20:25:16.2470266Z 2025-05-07T20:25:16.2470272Z 2025-05-07T20:25:16.2470277Z 2025-05-07T20:25:16.2470282Z 2025-05-07T20:25:16.2472135Z 2025-05-07T20:25:16.2714034Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:16.2714455Z 2025-05-07T20:25:16.2714462Z 2025-05-07T20:25:16.2714467Z 2025-05-07T20:25:16.2714472Z 2025-05-07T20:25:16.2714477Z 2025-05-07T20:25:16.2714482Z 2025-05-07T20:25:16.2714487Z 2025-05-07T20:25:16.2721431Z 2025-05-07T20:25:16.2798082Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:16.2798667Z 2025-05-07T20:25:16.2798672Z 2025-05-07T20:25:16.2798677Z 2025-05-07T20:25:16.2798682Z 2025-05-07T20:25:16.2798688Z 2025-05-07T20:25:16.2798706Z 2025-05-07T20:25:16.2798711Z 2025-05-07T20:25:16.2799542Z 2025-05-07T20:25:16.2890490Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:16.2890870Z 2025-05-07T20:25:16.2890876Z 2025-05-07T20:25:16.2890881Z 2025-05-07T20:25:16.2890886Z 2025-05-07T20:25:16.2890891Z 2025-05-07T20:25:16.2890897Z 2025-05-07T20:25:16.2892476Z 2025-05-07T20:25:16.3362707Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:16.3363109Z 2025-05-07T20:25:16.3363114Z 2025-05-07T20:25:16.3363119Z 2025-05-07T20:25:16.3363124Z 2025-05-07T20:25:16.3363130Z 2025-05-07T20:25:16.3363135Z 2025-05-07T20:25:16.3363140Z 2025-05-07T20:25:16.3363145Z 2025-05-07T20:25:16.3363150Z 2025-05-07T20:25:16.3365227Z 2025-05-07T20:25:16.3420384Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:16.3420778Z 2025-05-07T20:25:16.3420783Z 2025-05-07T20:25:16.3420788Z 2025-05-07T20:25:16.3420803Z 2025-05-07T20:25:16.3420809Z 2025-05-07T20:25:16.3420814Z 2025-05-07T20:25:16.3420819Z 2025-05-07T20:25:16.3420824Z 2025-05-07T20:25:16.3420829Z 2025-05-07T20:25:16.3421204Z 2025-05-07T20:25:16.3466989Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:16.3561612Z gcc_impl_linux-64-11 | 53.0 MB | ###9 | 39% 2025-05-07T20:25:16.3561944Z 2025-05-07T20:25:16.3561950Z 2025-05-07T20:25:16.3561955Z 2025-05-07T20:25:16.3561971Z 2025-05-07T20:25:16.3561976Z 2025-05-07T20:25:16.3561981Z 2025-05-07T20:25:16.3561986Z 2025-05-07T20:25:16.3561992Z 2025-05-07T20:25:16.3561997Z 2025-05-07T20:25:16.3596953Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:16.3597327Z 2025-05-07T20:25:16.3597333Z 2025-05-07T20:25:16.3597338Z 2025-05-07T20:25:16.3597343Z 2025-05-07T20:25:16.3597348Z 2025-05-07T20:25:16.3597353Z 2025-05-07T20:25:16.3597367Z 2025-05-07T20:25:16.3597372Z 2025-05-07T20:25:16.3599312Z 2025-05-07T20:25:16.3785876Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.3786258Z 2025-05-07T20:25:16.3786264Z 2025-05-07T20:25:16.3786269Z 2025-05-07T20:25:16.3786274Z 2025-05-07T20:25:16.3786279Z 2025-05-07T20:25:16.3786284Z 2025-05-07T20:25:16.3786289Z 2025-05-07T20:25:16.3786295Z 2025-05-07T20:25:16.3786300Z 2025-05-07T20:25:16.3786305Z 2025-05-07T20:25:16.3787750Z 2025-05-07T20:25:16.3834945Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:16.3835582Z 2025-05-07T20:25:16.3835588Z 2025-05-07T20:25:16.3835593Z 2025-05-07T20:25:16.3835598Z 2025-05-07T20:25:16.3835603Z 2025-05-07T20:25:16.3835616Z 2025-05-07T20:25:16.3835622Z 2025-05-07T20:25:16.3835627Z 2025-05-07T20:25:16.3835632Z 2025-05-07T20:25:16.3835637Z 2025-05-07T20:25:16.3835642Z 2025-05-07T20:25:16.4057844Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:16.4058273Z 2025-05-07T20:25:16.4058279Z 2025-05-07T20:25:16.4058284Z 2025-05-07T20:25:16.4058971Z 2025-05-07T20:25:16.4469485Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:16.5099190Z gcc_impl_linux-64-11 | 53.0 MB | ####6 | 47% 2025-05-07T20:25:16.5105492Z 2025-05-07T20:25:16.5223169Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:16.5223498Z 2025-05-07T20:25:16.5224091Z 2025-05-07T20:25:16.5224706Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:16.5225049Z 2025-05-07T20:25:16.5226519Z 2025-05-07T20:25:16.5470887Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:16.5686949Z gcc_impl_linux-64-11 | 53.0 MB | #####5 | 56% 2025-05-07T20:25:16.5687286Z 2025-05-07T20:25:16.5687291Z 2025-05-07T20:25:16.5687306Z 2025-05-07T20:25:16.5687312Z 2025-05-07T20:25:16.5687317Z 2025-05-07T20:25:16.5689258Z 2025-05-07T20:25:16.6428864Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:16.6429292Z 2025-05-07T20:25:16.6429297Z 2025-05-07T20:25:16.6429302Z 2025-05-07T20:25:16.6429308Z 2025-05-07T20:25:16.6429313Z 2025-05-07T20:25:16.6429318Z 2025-05-07T20:25:16.6429323Z 2025-05-07T20:25:16.6429328Z 2025-05-07T20:25:16.6435230Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:16.6435634Z 2025-05-07T20:25:16.6435639Z 2025-05-07T20:25:16.6435645Z 2025-05-07T20:25:16.6435668Z 2025-05-07T20:25:16.6435673Z 2025-05-07T20:25:16.6435678Z 2025-05-07T20:25:16.6435683Z 2025-05-07T20:25:16.6435688Z 2025-05-07T20:25:16.6473695Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:16.7248161Z gcc_impl_linux-64-11 | 53.0 MB | ######7 | 68% 2025-05-07T20:25:16.7248523Z 2025-05-07T20:25:16.7248529Z 2025-05-07T20:25:16.7248534Z 2025-05-07T20:25:16.7248539Z 2025-05-07T20:25:16.7248544Z 2025-05-07T20:25:16.7248551Z 2025-05-07T20:25:16.7249411Z 2025-05-07T20:25:16.7257435Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:16.7257828Z 2025-05-07T20:25:16.7257833Z 2025-05-07T20:25:16.7257838Z 2025-05-07T20:25:16.7257843Z 2025-05-07T20:25:16.7257849Z 2025-05-07T20:25:16.7257854Z 2025-05-07T20:25:16.7257859Z 2025-05-07T20:25:16.7478411Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:16.7906193Z gcc_impl_linux-64-11 | 53.0 MB | #######9 | 80% 2025-05-07T20:25:16.7906562Z 2025-05-07T20:25:16.7906568Z 2025-05-07T20:25:16.7906573Z 2025-05-07T20:25:16.7906578Z 2025-05-07T20:25:16.7906583Z 2025-05-07T20:25:16.7906588Z 2025-05-07T20:25:16.7906594Z 2025-05-07T20:25:16.7906599Z 2025-05-07T20:25:16.7906604Z 2025-05-07T20:25:16.7907042Z 2025-05-07T20:25:16.7919919Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:16.7920305Z 2025-05-07T20:25:16.7920310Z 2025-05-07T20:25:16.7920316Z 2025-05-07T20:25:16.7920625Z 2025-05-07T20:25:16.7920632Z 2025-05-07T20:25:16.7920637Z 2025-05-07T20:25:16.7920643Z 2025-05-07T20:25:16.7920648Z 2025-05-07T20:25:16.7920653Z 2025-05-07T20:25:16.7920658Z 2025-05-07T20:25:16.8027498Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:16.8027913Z 2025-05-07T20:25:16.8027917Z 2025-05-07T20:25:16.8027921Z 2025-05-07T20:25:16.8027924Z 2025-05-07T20:25:16.8028235Z 2025-05-07T20:25:16.8484327Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:16.8572270Z gcc_impl_linux-64-11 | 53.0 MB | ########9 | 90% 2025-05-07T20:25:16.8572532Z 2025-05-07T20:25:16.8572537Z 2025-05-07T20:25:16.8572540Z 2025-05-07T20:25:16.8572544Z 2025-05-07T20:25:16.8572548Z 2025-05-07T20:25:16.8572551Z 2025-05-07T20:25:16.8572555Z 2025-05-07T20:25:16.8572559Z 2025-05-07T20:25:16.8572643Z 2025-05-07T20:25:16.8584987Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.8585361Z 2025-05-07T20:25:16.8585365Z 2025-05-07T20:25:16.8585368Z 2025-05-07T20:25:16.8585372Z 2025-05-07T20:25:16.8585375Z 2025-05-07T20:25:16.8585379Z 2025-05-07T20:25:16.8585382Z 2025-05-07T20:25:16.8585386Z 2025-05-07T20:25:16.8585389Z 2025-05-07T20:25:16.8656377Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:16.8656758Z 2025-05-07T20:25:16.8656764Z 2025-05-07T20:25:16.8656769Z 2025-05-07T20:25:16.8656786Z 2025-05-07T20:25:16.8656791Z 2025-05-07T20:25:16.8656796Z 2025-05-07T20:25:16.8656802Z 2025-05-07T20:25:16.8656807Z 2025-05-07T20:25:16.8656813Z 2025-05-07T20:25:16.8656818Z 2025-05-07T20:25:16.8658283Z 2025-05-07T20:25:16.8664578Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:16.8664949Z 2025-05-07T20:25:16.8664953Z 2025-05-07T20:25:16.8664956Z 2025-05-07T20:25:16.8664960Z 2025-05-07T20:25:16.8664963Z 2025-05-07T20:25:16.8664967Z 2025-05-07T20:25:16.8664978Z 2025-05-07T20:25:16.8664994Z 2025-05-07T20:25:16.8664998Z 2025-05-07T20:25:16.8665002Z 2025-05-07T20:25:16.8665005Z 2025-05-07T20:25:17.0085923Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:17.0086238Z 2025-05-07T20:25:17.0086242Z 2025-05-07T20:25:17.0086246Z 2025-05-07T20:25:17.1869376Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:17.1870157Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:17.2394437Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:17.2394705Z 2025-05-07T20:25:17.4903107Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:17.4903385Z 2025-05-07T20:25:17.4903390Z 2025-05-07T20:25:17.9128394Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:17.9135587Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:17.9135941Z 2025-05-07T20:25:17.9136160Z 2025-05-07T20:25:17.9136460Z  2025-05-07T20:25:17.9136723Z 2025-05-07T20:25:17.9136727Z 2025-05-07T20:25:17.9136905Z  2025-05-07T20:25:17.9137191Z 2025-05-07T20:25:17.9137197Z 2025-05-07T20:25:17.9137203Z 2025-05-07T20:25:17.9137391Z  2025-05-07T20:25:17.9137601Z 2025-05-07T20:25:17.9137617Z 2025-05-07T20:25:17.9137621Z 2025-05-07T20:25:17.9137625Z 2025-05-07T20:25:17.9137798Z  2025-05-07T20:25:17.9138010Z 2025-05-07T20:25:17.9138014Z 2025-05-07T20:25:17.9138018Z 2025-05-07T20:25:17.9138021Z 2025-05-07T20:25:17.9138025Z 2025-05-07T20:25:17.9138211Z  2025-05-07T20:25:17.9138425Z 2025-05-07T20:25:17.9138429Z 2025-05-07T20:25:17.9138433Z 2025-05-07T20:25:17.9138436Z 2025-05-07T20:25:17.9138684Z 2025-05-07T20:25:17.9138689Z 2025-05-07T20:25:17.9138874Z  2025-05-07T20:25:17.9139095Z 2025-05-07T20:25:17.9139099Z 2025-05-07T20:25:17.9139102Z 2025-05-07T20:25:17.9139106Z 2025-05-07T20:25:17.9139110Z 2025-05-07T20:25:17.9139113Z 2025-05-07T20:25:17.9139117Z 2025-05-07T20:25:17.9139298Z  2025-05-07T20:25:17.9139670Z 2025-05-07T20:25:17.9139673Z 2025-05-07T20:25:17.9139677Z 2025-05-07T20:25:17.9139680Z 2025-05-07T20:25:17.9139684Z 2025-05-07T20:25:17.9139687Z 2025-05-07T20:25:17.9139691Z 2025-05-07T20:25:17.9139694Z 2025-05-07T20:25:17.9139877Z  2025-05-07T20:25:17.9140102Z 2025-05-07T20:25:17.9140105Z 2025-05-07T20:25:17.9140109Z 2025-05-07T20:25:17.9140112Z 2025-05-07T20:25:17.9140116Z 2025-05-07T20:25:17.9140119Z 2025-05-07T20:25:17.9140130Z 2025-05-07T20:25:17.9140134Z 2025-05-07T20:25:17.9140138Z 2025-05-07T20:25:17.9140327Z  2025-05-07T20:25:17.9140540Z 2025-05-07T20:25:17.9140543Z 2025-05-07T20:25:17.9140547Z 2025-05-07T20:25:17.9140550Z 2025-05-07T20:25:17.9140554Z 2025-05-07T20:25:17.9140558Z 2025-05-07T20:25:17.9140561Z 2025-05-07T20:25:17.9140565Z 2025-05-07T20:25:17.9140568Z 2025-05-07T20:25:17.9140572Z 2025-05-07T20:25:17.9140775Z  2025-05-07T20:25:17.9140996Z 2025-05-07T20:25:17.9141000Z 2025-05-07T20:25:17.9141003Z 2025-05-07T20:25:17.9141007Z 2025-05-07T20:25:17.9141010Z 2025-05-07T20:25:17.9141014Z 2025-05-07T20:25:17.9141017Z 2025-05-07T20:25:17.9141021Z 2025-05-07T20:25:17.9141025Z 2025-05-07T20:25:17.9141028Z 2025-05-07T20:25:17.9141039Z 2025-05-07T20:25:17.9141247Z  done 2025-05-07T20:25:18.0142384Z Preparing transaction: \ done 2025-05-07T20:25:18.3150243Z Verifying transaction: / - \ done 2025-05-07T20:25:18.4160139Z Executing transaction: / done 2025-05-07T20:25:18.5810239Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:22.5070964Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:22.5071516Z 2025-05-07T20:25:22.5082442Z 2025-05-07T20:25:22.5102338Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:22.5103001Z 2025-05-07T20:25:22.5115182Z 2025-05-07T20:25:22.5132724Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:22.5133401Z 2025-05-07T20:25:22.5144406Z 2025-05-07T20:25:22.5160123Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:22.5160812Z 2025-05-07T20:25:22.5171455Z 2025-05-07T20:25:24.4196770Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:24.4197042Z 2025-05-07T20:25:24.4824491Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:26.3806701Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:26.3807051Z 2025-05-07T20:25:26.4444697Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:28.3393039Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:28.3393409Z 2025-05-07T20:25:28.4027051Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:30.3048146Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:30.3048531Z 2025-05-07T20:25:30.3674804Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:30.3679137Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:30.3679703Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:30.3680351Z 2025-05-07T20:25:32.2664271Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:32.2664744Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:32.2665151Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:32.2665491Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:32.2665877Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:32.2666224Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:32.2666501Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:32.2667171Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:32.2667433Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:32.2667682Z #define __CHAR_BIT__ 8 2025-05-07T20:25:32.2667909Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:32.2668158Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:32.2668417Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:32.2668686Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:32.2669107Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:32.2669409Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2669719Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:32.2670012Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:32.2670334Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:32.2670645Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:32.2671084Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:32.2671490Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:32.2671807Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:32.2672077Z #define __GCC_IEC_559 2 2025-05-07T20:25:32.2672321Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:32.2672592Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:32.2672846Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:32.2673129Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:32.2673454Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2673764Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:32.2674039Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:32.2674311Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:32.2674567Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:32.2674829Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:32.2675087Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:32.2675345Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:32.2675597Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:32.2675843Z #define __INT8_C(c) c 2025-05-07T20:25:32.2676081Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:32.2676380Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2676726Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:32.2677069Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:32.2677415Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:32.2677688Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:32.2677948Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2678218Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:32.2678498Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:32.2678884Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:32.2679293Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:32.2679575Z #define __linux 1 2025-05-07T20:25:32.2679806Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:32.2680086Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:32.2680354Z #define __unix 1 2025-05-07T20:25:32.2680584Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:32.2680871Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:32.2681134Z #define __WINT_MIN__ 0U 2025-05-07T20:25:32.2681377Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:32.2681655Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:32.2681924Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:32.2682188Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:32.2682438Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:32.2682712Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:32.2683208Z #define __INT64_C(c) c ## L 2025-05-07T20:25:32.2683475Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:32.2683769Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:32.2684023Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:32.2684364Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:32.2684732Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:32.2684977Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:32.2685236Z #define __DBL_DIG__ 15 2025-05-07T20:25:32.2685548Z #define __FLT32_DIG__ 6 2025-05-07T20:25:32.2685840Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:32.2686185Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:32.2686431Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:32.2686745Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:32.2687085Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:32.2687330Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:32.2687585Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:32.2687985Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:32.2688370Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:32.2688645Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:32.2688904Z #define __unix__ 1 2025-05-07T20:25:32.2689120Z #define __INT_WIDTH__ 32 2025-05-07T20:25:32.2689356Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:32.2689599Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:32.2689843Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:32.2690116Z #define __UINT16_C(c) c 2025-05-07T20:25:32.2690350Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:32.2690592Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:32.2690945Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:32.2691302Z #define __gnu_linux__ 1 2025-05-07T20:25:32.2691544Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:32.2691814Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:32.2692093Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2692370Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:32.2692622Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:32.2692869Z #define __GNUC__ 11 2025-05-07T20:25:32.2693082Z #define __pie__ 2 2025-05-07T20:25:32.2703538Z #define __MMX__ 1 2025-05-07T20:25:32.2703777Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:32.2704034Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:32.2704304Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:32.2704567Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:32.2704911Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:32.2705298Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2705601Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:32.2705854Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:32.2706105Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:32.2706389Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:32.2706643Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:32.2706901Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:32.2707194Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:32.2707488Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:32.2707746Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:32.2708012Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:32.2708252Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:32.2708499Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:32.2708764Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:32.2709034Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:32.2709287Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:32.2709605Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:32.2709953Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:32.2710218Z #define __SSE2_MATH__ 1 2025-05-07T20:25:32.2710457Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:32.2710756Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2711044Z #define __amd64 1 2025-05-07T20:25:32.2711546Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:32.2711816Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:32.2712110Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:32.2712419Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:32.2712671Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:32.2712943Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:32.2713193Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:32.2713446Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:32.2713702Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:32.2714116Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:32.2714372Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:32.2714647Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:32.2714891Z #define __x86_64 1 2025-05-07T20:25:32.2715114Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:32.2715479Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:32.2715929Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:32.2716375Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:32.2716837Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:32.2717216Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:32.2717461Z #define __LP64__ 1 2025-05-07T20:25:32.2717677Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2718021Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:32.2718397Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:32.2718663Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:32.2718938Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:32.2719216Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:32.2719480Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:32.2719744Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:32.2720000Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:32.2720247Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:32.2720505Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:32.2720834Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:32.2721189Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:32.2721455Z #define __FLT_DIG__ 6 2025-05-07T20:25:32.2721684Z #define __NO_INLINE__ 1 2025-05-07T20:25:32.2721925Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:32.2722242Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:32.2722587Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:32.2722848Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:32.2723104Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:32.2723357Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:32.2723619Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:32.2723867Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:32.2724159Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:32.2724443Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:32.2724703Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:32.2725003Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:32.2725326Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:32.2725584Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:32.2725828Z #define __FLT128_DIG__ 33 2025-05-07T20:25:32.2726058Z #define __INT32_C(c) c 2025-05-07T20:25:32.2726293Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:32.2726557Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:32.2726830Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:32.2727151Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:32.2727456Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:32.2727753Z #define unix 1 2025-05-07T20:25:32.2727983Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:32.2728277Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2728574Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:32.2728876Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:32.2729195Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:32.2729433Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:32.2729856Z #define __ELF__ 1 2025-05-07T20:25:32.2730088Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:32.2730358Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:32.2730627Z #define __FLT_RADIX__ 2 2025-05-07T20:25:32.2730866Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:32.2731208Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:32.2731558Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:32.2731804Z #define __SSE_MATH__ 1 2025-05-07T20:25:32.2732102Z #define __k8 1 2025-05-07T20:25:32.2732394Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:32.2732758Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:32.2733038Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:32.2733330Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:32.2733583Z #define __LDBL_DIG__ 18 2025-05-07T20:25:32.2733981Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:32.2734224Z #define __x86_64__ 1 2025-05-07T20:25:32.2734462Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:32.2734754Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:32.2735072Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2735372Z #define __FLT64_DIG__ 15 2025-05-07T20:25:32.2735647Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2735978Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:32.2736289Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2736557Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:32.2736839Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2737166Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:32.2737523Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:32.2737911Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:32.2738190Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:32.2738513Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:32.2738830Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:32.2739112Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:32.2739382Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:32.2739679Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:32.2739945Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:32.2740180Z #define __SEG_FS 1 2025-05-07T20:25:32.2740399Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:32.2740658Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:32.2740927Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2741209Z #define __SEG_GS 1 2025-05-07T20:25:32.2741504Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:32.2741877Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:32.2742139Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:32.2742421Z #define __INT16_TYPE__ short int 2025-05-07T20:25:32.2742688Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:32.2742972Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:32.2743233Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:32.2743468Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:32.2743718Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:32.2744052Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:32.2744420Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2744703Z #define linux 1 2025-05-07T20:25:32.2744920Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2745184Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:32.2745453Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:32.2745698Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:32.2745944Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:32.2746199Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:32.2746536Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:32.2746938Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:32.2747258Z #define __code_model_small__ 1 2025-05-07T20:25:32.2747645Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:32.2747929Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:32.2748162Z #define __k8__ 1 2025-05-07T20:25:32.2748388Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:32.2748668Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:32.2748950Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:32.2749189Z #define __pic__ 2 2025-05-07T20:25:32.2749431Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2749725Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:32.2750090Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2750413Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:32.2750772Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:32.2751121Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:32.2751385Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:32.2751671Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:32.2751970Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:32.2752224Z #define __linux__ 1 2025-05-07T20:25:32.2752444Z #define __INT64_TYPE__ long int 2025-05-07T20:25:32.2752691Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:32.2752942Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:32.2753206Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:32.2753445Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:32.2753727Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2754041Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:32.2754331Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:32.2754589Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:32.2754873Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:32.2755161Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:32.2755475Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:32.2755822Z #define __SSE__ 1 2025-05-07T20:25:32.2756043Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:32.2756366Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:32.2756703Z #define __amd64__ 1 2025-05-07T20:25:32.2756921Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:32.2757160Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:32.2757417Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:32.2757678Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:32.2757952Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:32.2758214Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:32.2758463Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:32.2758730Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:32.2758988Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:32.2759326Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:32.2759781Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:32.2760119Z #define _LP64 1 2025-05-07T20:25:32.2760332Z #define __UINT8_C(c) c 2025-05-07T20:25:32.2760566Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:32.2760820Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:32.2761087Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:32.2761351Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:32.2761640Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:32.2761986Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:32.2762440Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:32.2762800Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2763090Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:32.2763397Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:32.2763752Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:32.2764101Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:32.2764356Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:32.2764685Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:32.2765036Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:32.2765287Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:32.2765642Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:32.2765879Z #define __FXSR__ 1 2025-05-07T20:25:32.2766170Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:32.2766635Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:32.2767057Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:32.2767359Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:32.2767607Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:32.2768011Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:32.2768355Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:32.2768590Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:32.2768816Z #define __PIC__ 2 2025-05-07T20:25:32.2769052Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:32.2769438Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:32.2769818Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:32.2770148Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:32.2770472Z #define __SSE2__ 1 2025-05-07T20:25:32.2770688Z #define __INT32_TYPE__ int 2025-05-07T20:25:32.2770922Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:32.2771171Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:32.2771498Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:32.2771846Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:32.2772104Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:32.2772379Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:32.2772643Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2772908Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:32.2773149Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:32.2773393Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:32.2773800Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2774096Z #define __PIE__ 2 2025-05-07T20:25:32.2774415Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:32.2774789Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:32.2775129Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:32.2775484Z #define __INT16_C(c) c 2025-05-07T20:25:32.2775705Z #define __STDC__ 1 2025-05-07T20:25:32.2775924Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:32.2776189Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:32.2776441Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:32.2776732Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:32.2777074Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:32.2777401Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:32.2777652Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:32.2777929Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:32.2778189Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:32.2778454Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:32.2778737Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:32.2779008Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:32.2779288Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:32.2779671Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:32.2780028Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:32.2780324Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:32.2780609Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:32.2780851Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:32.2781011Z 2025-05-07T20:25:32.3310216Z 2025-05-07T20:25:32.3310998Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:32.3311453Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:32.3311676Z 2025-05-07T20:25:34.2336213Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:34.2336594Z #define __cpp_attributes 200809L 2025-05-07T20:25:34.2336961Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:34.2337330Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:34.2337974Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:34.2338240Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:34.2338566Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:34.2338911Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:34.2339185Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:34.2339487Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:34.2339790Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:34.2340048Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:34.2340444Z #define __CHAR_BIT__ 8 2025-05-07T20:25:34.2340679Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:34.2340919Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:34.2341170Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:34.2341435Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:34.2341707Z #define __cpp_static_assert 201411L 2025-05-07T20:25:34.2341985Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:34.2342282Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2342588Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:34.2342868Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:34.2343188Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:34.2343507Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:34.2343894Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:34.2344300Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:34.2344603Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:34.2344882Z #define __GCC_IEC_559 2 2025-05-07T20:25:34.2345127Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:34.2345398Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:34.2345666Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:34.2345945Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:34.2346231Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:34.2346548Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:34.2346844Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:34.2347186Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2347549Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:34.2347812Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:34.2348085Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:34.2348367Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:34.2348660Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:34.2348924Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:34.2349187Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:34.2349463Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:34.2349794Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:34.2350122Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:34.2350379Z #define __INT8_C(c) c 2025-05-07T20:25:34.2350612Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:34.2350880Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:34.2351201Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2351521Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:34.2351803Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:34.2352099Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:34.2352404Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:34.2352755Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:34.2353041Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:34.2353314Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:34.2353578Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2353857Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:34.2354136Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:34.2354514Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:34.2354921Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:34.2355209Z #define __linux 1 2025-05-07T20:25:34.2355434Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:34.2355712Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:34.2355987Z #define __unix 1 2025-05-07T20:25:34.2356289Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:34.2356576Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:34.2356862Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:34.2357122Z #define __WINT_MIN__ 0U 2025-05-07T20:25:34.2357366Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:34.2357647Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:34.2357912Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:34.2358177Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:34.2358496Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:34.2358777Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:34.2359064Z #define __INT64_C(c) c ## L 2025-05-07T20:25:34.2359329Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:34.2359622Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:34.2359889Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:34.2360190Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:34.2360465Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:34.2360726Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:34.2361068Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:34.2361437Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:34.2361683Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:34.2361961Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:34.2362235Z #define __DBL_DIG__ 15 2025-05-07T20:25:34.2362456Z #define __FLT32_DIG__ 6 2025-05-07T20:25:34.2362756Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:34.2363102Z #define __GXX_WEAK__ 1 2025-05-07T20:25:34.2363336Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:34.2363577Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:34.2363901Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:34.2364251Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:34.2364510Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:34.2364810Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:34.2365141Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:34.2365553Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:34.2373947Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:34.2374237Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:34.2374496Z #define __unix__ 1 2025-05-07T20:25:34.2374718Z #define __INT_WIDTH__ 32 2025-05-07T20:25:34.2374948Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:34.2375190Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:34.2375449Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:34.2375714Z #define __UINT16_C(c) c 2025-05-07T20:25:34.2375948Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:34.2376199Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:34.2376553Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:34.2376907Z #define __gnu_linux__ 1 2025-05-07T20:25:34.2377147Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:34.2377408Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:34.2377751Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:34.2378035Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2378307Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:34.2378559Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:34.2378805Z #define __GNUC__ 11 2025-05-07T20:25:34.2379023Z #define __GXX_RTTI 1 2025-05-07T20:25:34.2379236Z #define __pie__ 2 2025-05-07T20:25:34.2379450Z #define __MMX__ 1 2025-05-07T20:25:34.2379662Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:34.2379926Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:34.2380208Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:34.2380462Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:34.2380707Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:34.2381002Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:34.2381307Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:34.2381647Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:34.2382015Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:34.2382450Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2382755Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:34.2383012Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:34.2383269Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:34.2383568Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:34.2383858Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:34.2384118Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:34.2384368Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:34.2384725Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:34.2385016Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:34.2385271Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:34.2385544Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:34.2385791Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:34.2386043Z #define __cplusplus 201703L 2025-05-07T20:25:34.2386307Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:34.2386585Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:34.2386842Z #define __DEPRECATED 1 2025-05-07T20:25:34.2387116Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:34.2387430Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:34.2387679Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:34.2387987Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:34.2388339Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:34.2388609Z #define __SSE2_MATH__ 1 2025-05-07T20:25:34.2388846Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:34.2389154Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2389439Z #define __amd64 1 2025-05-07T20:25:34.2389654Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:34.2389919Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:34.2390180Z #define __GNUG__ 11 2025-05-07T20:25:34.2390429Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:34.2390738Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:34.2390984Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:34.2391239Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:34.2391511Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:34.2391760Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:34.2392037Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:34.2392319Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:34.2392571Z #define __cpp_hex_float 201603L 2025-05-07T20:25:34.2392834Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:34.2393088Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:34.2393355Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:34.2393622Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:34.2393875Z #define __x86_64 1 2025-05-07T20:25:34.2394096Z #define __cpp_lambdas 200907L 2025-05-07T20:25:34.2394358Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:34.2394710Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:34.2395094Z #define __cpp_template_auto 201606L 2025-05-07T20:25:34.2395443Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:34.2395890Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:34.2396341Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:34.2396721Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:34.2396968Z #define __LP64__ 1 2025-05-07T20:25:34.2397184Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2397527Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:34.2397897Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:34.2398452Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:34.2398819Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:34.2399090Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:34.2399351Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:34.2399597Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:34.2399858Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:34.2400176Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:34.2400707Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:34.2400986Z #define __FLT_DIG__ 6 2025-05-07T20:25:34.2401212Z #define __NO_INLINE__ 1 2025-05-07T20:25:34.2401440Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:34.2401755Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:34.2402096Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:34.2402338Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:34.2402598Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:34.2402849Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:34.2403217Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:34.2403506Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:34.2403753Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:34.2404039Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:34.2404309Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:34.2404573Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:34.2404869Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:34.2405201Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:34.2405483Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:34.2405739Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:34.2405985Z #define __FLT128_DIG__ 33 2025-05-07T20:25:34.2406223Z #define __INT32_C(c) c 2025-05-07T20:25:34.2406461Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:34.2406730Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:34.2407002Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:34.2407274Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:34.2407578Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:34.2407879Z #define unix 1 2025-05-07T20:25:34.2408092Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:34.2408346Z #define __cpp_rtti 199711L 2025-05-07T20:25:34.2408596Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:34.2408902Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2409197Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:34.2409493Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:34.2409819Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:34.2410064Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:34.2410337Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:34.2410608Z #define __ELF__ 1 2025-05-07T20:25:34.2410835Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:34.2411106Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:34.2411375Z #define __FLT_RADIX__ 2 2025-05-07T20:25:34.2411616Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:34.2411961Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:34.2412317Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:34.2412581Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:34.2412852Z #define __k8 1 2025-05-07T20:25:34.2413133Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:34.2413493Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:34.2413853Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:34.2414141Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:34.2414394Z #define __LDBL_DIG__ 18 2025-05-07T20:25:34.2414631Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:34.2414877Z #define __x86_64__ 1 2025-05-07T20:25:34.2415111Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:34.2415401Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:34.2415726Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2416023Z #define __FLT64_DIG__ 15 2025-05-07T20:25:34.2416301Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2416647Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:34.2416950Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2417214Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:34.2417487Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2417772Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:34.2418127Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:34.2418517Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:34.2418873Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:34.2419193Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:34.2419501Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:34.2419810Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:34.2420104Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:34.2420380Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:34.2420681Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:34.2421014Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:34.2421249Z #define __SEG_FS 1 2025-05-07T20:25:34.2421472Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:34.2421734Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:34.2422001Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2422281Z #define __SEG_GS 1 2025-05-07T20:25:34.2422577Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:34.2422950Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:34.2423223Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:34.2423498Z #define __INT16_TYPE__ short int 2025-05-07T20:25:34.2423767Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:34.2424064Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:34.2424347Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:34.2424589Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:34.2424843Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:34.2425171Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:34.2425552Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2425862Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:34.2426176Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:34.2426460Z #define linux 1 2025-05-07T20:25:34.2426680Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2426951Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:34.2427213Z #define __EXCEPTIONS 1 2025-05-07T20:25:34.2427452Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:34.2427712Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:34.2427968Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:34.2428251Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:34.2428589Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:34.2428959Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:34.2429295Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:34.2429616Z #define __code_model_small__ 1 2025-05-07T20:25:34.2429890Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:34.2430182Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:34.2430476Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:34.2430745Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:34.2431022Z #define __k8__ 1 2025-05-07T20:25:34.2431242Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:34.2431518Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:34.2431801Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:34.2432045Z #define __pic__ 2 2025-05-07T20:25:34.2432291Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2432585Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:34.2432846Z #define __cpp_decltype 200707L 2025-05-07T20:25:34.2433131Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2433446Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:34.2433801Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:34.2434156Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:34.2434441Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:34.2434747Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:34.2435034Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:34.2435281Z #define __linux__ 1 2025-05-07T20:25:34.2435500Z #define __INT64_TYPE__ long int 2025-05-07T20:25:34.2435753Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:34.2436007Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:34.2436265Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:34.2436618Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:34.2436931Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:34.2437213Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2437518Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:34.2437778Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:34.2438057Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:34.2438343Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:34.2438735Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:34.2439084Z #define __SSE__ 1 2025-05-07T20:25:34.2439302Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:34.2439633Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:34.2439966Z #define __amd64__ 1 2025-05-07T20:25:34.2440178Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:34.2440425Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:34.2440690Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:34.2440949Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:34.2441214Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:34.2441467Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:34.2441725Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:34.2441987Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:34.2442323Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:34.2442774Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:34.2443119Z #define _LP64 1 2025-05-07T20:25:34.2443330Z #define __UINT8_C(c) c 2025-05-07T20:25:34.2443564Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:34.2443818Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:34.2444083Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:34.2444338Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:34.2444675Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:34.2445125Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:34.2445497Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2445778Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:34.2446080Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:34.2446378Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:34.2446745Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:34.2447097Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:34.2447353Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:34.2447616Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:34.2447936Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:34.2448293Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:34.2448545Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:34.2448782Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:34.2449020Z #define __FXSR__ 1 2025-05-07T20:25:34.2449314Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:34.2449755Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:34.2450154Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:34.2450453Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:34.2450709Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:34.2450993Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:34.2451280Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:34.2451542Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:34.2451885Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:34.2452243Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:34.2452503Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:34.2452736Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:34.2452964Z #define __PIC__ 2 2025-05-07T20:25:34.2453209Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:34.2453590Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:34.2454029Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:34.2454456Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:34.2454798Z #define __cpp_constexpr 201603L 2025-05-07T20:25:34.2455040Z #define __SSE2__ 1 2025-05-07T20:25:34.2455273Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:34.2455561Z #define __INT32_TYPE__ int 2025-05-07T20:25:34.2455804Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:34.2456060Z #define __cpp_exceptions 199711L 2025-05-07T20:25:34.2456331Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:34.2456725Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:34.2457072Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:34.2457336Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:34.2457593Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:34.2457856Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2458124Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:34.2458368Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:34.2458613Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:34.2458904Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:34.2459189Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2459470Z #define __PIE__ 2 2025-05-07T20:25:34.2459782Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:34.2460190Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:34.2460481Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:34.2460822Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:34.2461183Z #define __INT16_C(c) c 2025-05-07T20:25:34.2461398Z #define __STDC__ 1 2025-05-07T20:25:34.2461612Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:34.2461858Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:34.2462117Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:34.2462366Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:34.2462655Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:34.2462994Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:34.2463315Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:34.2463576Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:34.2463854Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:34.2464120Z #define __SSE_MATH__ 1 2025-05-07T20:25:34.2464357Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:34.2464634Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:34.2464931Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:34.2465207Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:34.2465498Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:34.2465757Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:34.2466048Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:34.2466432Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:34.2466793Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:34.2467084Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:34.2467374Z #define _GNU_SOURCE 1 2025-05-07T20:25:34.2467621Z #define __cpp_init_captures 201304L 2025-05-07T20:25:34.2467893Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:34.2468138Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:34.2468292Z 2025-05-07T20:25:34.2969801Z 2025-05-07T20:25:34.2970356Z + conda run -n build_binary c++ --version 2025-05-07T20:25:34.2970780Z 2025-05-07T20:25:36.1929834Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:36.1930447Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:36.1931127Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:36.1931933Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:36.1932416Z 2025-05-07T20:25:36.1932423Z 2025-05-07T20:25:36.2569414Z 2025-05-07T20:25:36.2570653Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:36.2571739Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:36.2572350Z 2025-05-07T20:25:38.2220779Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:38.2222622Z 2025-05-07T20:25:38.2223098Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:38.2223654Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:38.2223957Z 2025-05-07T20:25:40.1879810Z #define __cplusplus 201703L 2025-05-07T20:25:40.1882221Z 2025-05-07T20:25:40.1883006Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:40.1930675Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:40.1931086Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:40.1943681Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:40.1944026Z env: 2025-05-07T20:25:40.1944256Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:40.1944552Z BUILD_ENV: build_binary 2025-05-07T20:25:40.1944801Z BUILD_TARGET: genai 2025-05-07T20:25:40.1945031Z BUILD_VARIANT: cuda 2025-05-07T20:25:40.1945262Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:40.1945525Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:40.1945828Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:40.1946156Z ##[endgroup] 2025-05-07T20:25:40.5332740Z ################################################################################ 2025-05-07T20:25:40.5333105Z # Install CUDA 2025-05-07T20:25:40.5333318Z # 2025-05-07T20:25:40.5349661Z # [2025-05-07T20:25:40.534Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:40.5350048Z ################################################################################ 2025-05-07T20:25:40.5350262Z 2025-05-07T20:25:40.5366566Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:40.6292208Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:40.6292595Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:40.6297682Z + conda clean --packages --tarball -y 2025-05-07T20:25:40.6297898Z 2025-05-07T20:25:41.3409510Z Will remove 29 (113.6 MB) tarball(s). 2025-05-07T20:25:41.3409867Z Will remove 6 (619 KB) package(s). 2025-05-07T20:25:41.4088118Z 2025-05-07T20:25:41.4096961Z + conda clean --all -y 2025-05-07T20:25:41.4097176Z 2025-05-07T20:25:42.0815092Z There are no unused tarball(s) to remove. 2025-05-07T20:25:42.0815420Z Will remove 1 index cache(s). 2025-05-07T20:25:42.0815789Z There are no unused package(s) to remove. 2025-05-07T20:25:42.0816219Z There are no tempfile(s) to remove. 2025-05-07T20:25:42.0816615Z There are no logfile(s) to remove. 2025-05-07T20:25:42.1450625Z 2025-05-07T20:25:42.1464203Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:25:42.1489082Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:25:43.0597217Z Channels: 2025-05-07T20:25:43.0597709Z - conda-forge 2025-05-07T20:25:43.0598538Z Platform: linux-64 2025-05-07T20:25:53.6575149Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - done 2025-05-07T20:25:54.7668159Z Solving environment: | / - \ | done 2025-05-07T20:25:54.8423350Z 2025-05-07T20:25:54.8423680Z ## Package Plan ## 2025-05-07T20:25:54.8423840Z 2025-05-07T20:25:54.8424189Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:54.8424532Z 2025-05-07T20:25:54.8424633Z added / updated specs: 2025-05-07T20:25:54.8424883Z - cuda=12.6.3 2025-05-07T20:25:54.8425020Z 2025-05-07T20:25:54.8425054Z 2025-05-07T20:25:54.8425186Z The following packages will be downloaded: 2025-05-07T20:25:54.8425399Z 2025-05-07T20:25:54.8425519Z package | build 2025-05-07T20:25:54.8425830Z ---------------------------|----------------- 2025-05-07T20:25:54.8426197Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:25:54.8426602Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:25:54.8427001Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:25:54.8427522Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:25:54.8428112Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:25:54.8428623Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:25:54.8429300Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.8430785Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:25:54.8431358Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:25:54.8431991Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:25:54.8432606Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:54.8433226Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:54.8433873Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:25:54.8434548Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:54.8435152Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:25:54.8435662Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:25:54.8436143Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:25:54.8436586Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:25:54.8437032Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:25:54.8437486Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:25:54.8437943Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:25:54.8438421Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:25:54.8438877Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:25:54.8439308Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.8439770Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:25:54.8440235Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:25:54.8440670Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:25:54.8441124Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:25:54.8441583Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:25:54.8442037Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:25:54.8442502Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:25:54.8442950Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:25:54.8443399Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:25:54.8443842Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:25:54.8444282Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:25:54.8444719Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:25:54.8445163Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:25:54.8445604Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:25:54.8446050Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:25:54.8446516Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:25:54.8446972Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:25:54.8447416Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:25:54.8447844Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:25:54.8448293Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:25:54.8448767Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:25:54.8449459Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:54.8449916Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:25:54.8450379Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:25:54.8450809Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:54.8451229Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:25:54.8451683Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:25:54.8452136Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:25:54.8452543Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:25:54.8452921Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:54.8453385Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:25:54.8454062Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:25:54.8454563Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:25:54.8455053Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:25:54.8455506Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:25:54.8455955Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:25:54.8456421Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:25:54.8456851Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:25:54.8457245Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:54.8457636Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:25:54.8458043Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:25:54.8458411Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:25:54.8458789Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:25:54.8459180Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:25:54.8459565Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:25:54.8459973Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:25:54.8460410Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:25:54.8460846Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:25:54.8461283Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:25:54.8461720Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:25:54.8462166Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:25:54.8462607Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:25:54.8463052Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:25:54.8463494Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:25:54.8463947Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:25:54.8464399Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:25:54.8464842Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:25:54.8465309Z libedit-3.1.20250104 | pl5321h7949ede_0 132 KB conda-forge 2025-05-07T20:25:54.8465745Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:54.8466275Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:25:54.8466787Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:25:54.8467234Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:25:54.8467662Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:25:54.8468083Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:25:54.8468498Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:25:54.8468893Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:25:54.8469288Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:25:54.8469704Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:25:54.8470128Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:25:54.8470557Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:25:54.8471006Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:25:54.8471459Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:25:54.8471924Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:25:54.8472377Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:25:54.8472815Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:25:54.8473241Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:25:54.8473655Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:25:54.8474085Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:25:54.8474507Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:25:54.8474924Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:54.8475323Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:25:54.8475737Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:25:54.8476173Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:25:54.8476590Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:25:54.8476990Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:25:54.8477375Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:25:54.8477765Z ncurses-6.5 | h2d0b736_3 871 KB conda-forge 2025-05-07T20:25:54.8478212Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:25:54.8478650Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:25:54.8479023Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:25:54.8479417Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:25:54.8479857Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:25:54.8480280Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:25:54.8480701Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:25:54.8481142Z python-3.13.0 |h9ebbce0_101_cp313 31.5 MB conda-forge 2025-05-07T20:25:54.8481559Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:25:54.8481962Z sqlite-3.49.2 | h9eae976_0 840 KB conda-forge 2025-05-07T20:25:54.8482450Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:25:54.8482969Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:25:54.8483367Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:25:54.8483847Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:25:54.8484298Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:25:54.8484743Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:25:54.8485211Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:25:54.8485663Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:25:54.8486105Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:25:54.8486546Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:25:54.8486989Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:25:54.8487413Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:25:54.8487837Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:25:54.8488289Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:25:54.8488757Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:25:54.8489207Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:54.8489642Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:25:54.8490074Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:25:54.8490502Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:25:54.8490941Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:25:54.8491391Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:25:54.8491837Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:25:54.8492243Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:25:54.8492619Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:25:54.8492985Z ------------------------------------------------------------ 2025-05-07T20:25:54.8493321Z Total: 1.64 GB 2025-05-07T20:25:54.8493528Z 2025-05-07T20:25:54.8493782Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:54.8494004Z 2025-05-07T20:25:54.8494204Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:25:54.8494623Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:25:54.8495038Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:25:54.8495498Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:25:54.8495919Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:25:54.8496390Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:25:54.8496975Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:25:54.8497549Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:25:54.8498087Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:54.8499077Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:25:54.8499593Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:25:54.8500111Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:25:54.8501140Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:54.8501739Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:25:54.8502385Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:54.8503018Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:54.8503581Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8504088Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:25:54.8504582Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:25:54.8505106Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8505644Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:25:54.8506215Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:25:54.8506734Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:25:54.8507217Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:25:54.8507766Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:25:54.8508299Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:25:54.8508765Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:25:54.8509281Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:25:54.8509828Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:25:54.8510357Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:25:54.8510894Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:25:54.8511432Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8511945Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8512437Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:25:54.8512933Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8513424Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:25:54.8513926Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:25:54.8514406Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:25:54.8514913Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:25:54.8515461Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:25:54.8515997Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:25:54.8516550Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:25:54.8517222Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:25:54.8517921Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:25:54.8518534Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:25:54.8519057Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:25:54.8519595Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:25:54.8520137Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:25:54.8520628Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:25:54.8521272Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:25:54.8522114Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:25:54.8522740Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:25:54.8523185Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:25:54.8523679Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:25:54.8524271Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:25:54.8524864Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:25:54.8525417Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:25:54.8525909Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:25:54.8526398Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:25:54.8526879Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:25:54.8527333Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:25:54.8527756Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:25:54.8528175Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:25:54.8528647Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:25:54.8529169Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:25:54.8529730Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:25:54.8530209Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:25:54.8530608Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:25:54.8531040Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:25:54.8531548Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:25:54.8532047Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:25:54.8532522Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:25:54.8533006Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:25:54.8533499Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:25:54.8534129Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:25:54.8534623Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:25:54.8535131Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:25:54.8535655Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:25:54.8536180Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:25:54.8536699Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:25:54.8537230Z libedit conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 2025-05-07T20:25:54.8537710Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:54.8538171Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:25:54.8538656Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:25:54.8539163Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:25:54.8539638Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:25:54.8540095Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:25:54.8540553Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:25:54.8540976Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:25:54.8541395Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:25:54.8541967Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:25:54.8542502Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:25:54.8542963Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:25:54.8543483Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:25:54.8544002Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:25:54.8544544Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:25:54.8545060Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:25:54.8545575Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:25:54.8546048Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:25:54.8546487Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T20:25:54.8546960Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:25:54.8547424Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:25:54.8547841Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:25:54.8548296Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:25:54.8548778Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:25:54.8549228Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:25:54.8549646Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:54.8558945Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:25:54.8559681Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:25:54.8560342Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:25:54.8560732Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:25:54.8561148Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:25:54.8561650Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:25:54.8562140Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:25:54.8562598Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:25:54.8563139Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:25:54.8563581Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:25:54.8564018Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:25:54.8564495Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:25:54.8565024Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:25:54.8565563Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:25:54.8566142Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:25:54.8566669Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:25:54.8567179Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:25:54.8567698Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:25:54.8568177Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:25:54.8568636Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:25:54.8569116Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:25:54.8569656Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:25:54.8570227Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:25:54.8570920Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:25:54.8571515Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:25:54.8572030Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:25:54.8572521Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:25:54.8573015Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:25:54.8573554Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:25:54.8574213Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:25:54.8574918Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:25:54.8575170Z 2025-05-07T20:25:54.8575285Z The following packages will be UPDATED: 2025-05-07T20:25:54.8575490Z 2025-05-07T20:25:54.8575773Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:54.8576393Z ncurses pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 2025-05-07T20:25:54.8576986Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 2025-05-07T20:25:54.8577559Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 2025-05-07T20:25:54.8577887Z 2025-05-07T20:25:54.8578102Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:54.8578408Z 2025-05-07T20:25:54.8578648Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:25:54.8579326Z python pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 2025-05-07T20:25:54.8580142Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:25:54.8580521Z 2025-05-07T20:25:54.8580555Z 2025-05-07T20:25:54.8580559Z 2025-05-07T20:25:54.8580706Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:54.8581089Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:54.8581323Z 2025-05-07T20:25:54.8581723Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:54.8581958Z 2025-05-07T20:25:54.8581962Z 2025-05-07T20:25:54.8582176Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:25:54.8582418Z 2025-05-07T20:25:54.8582422Z 2025-05-07T20:25:54.8582426Z 2025-05-07T20:25:54.8582648Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:25:54.8582908Z 2025-05-07T20:25:54.8582912Z 2025-05-07T20:25:54.8582916Z 2025-05-07T20:25:54.8582920Z 2025-05-07T20:25:54.8583154Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:25:54.8583420Z 2025-05-07T20:25:54.8583424Z 2025-05-07T20:25:54.8583428Z 2025-05-07T20:25:54.8583432Z 2025-05-07T20:25:54.8586623Z 2025-05-07T20:25:54.8596578Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:25:54.8596949Z 2025-05-07T20:25:54.8596954Z 2025-05-07T20:25:54.8596960Z 2025-05-07T20:25:54.8596965Z 2025-05-07T20:25:54.8596970Z 2025-05-07T20:25:54.8596979Z 2025-05-07T20:25:54.8599813Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:25:54.8600190Z 2025-05-07T20:25:54.8600195Z 2025-05-07T20:25:54.8600201Z 2025-05-07T20:25:54.8600206Z 2025-05-07T20:25:54.8600682Z 2025-05-07T20:25:54.8600688Z 2025-05-07T20:25:54.8600693Z 2025-05-07T20:25:54.8602738Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:25:54.8603109Z 2025-05-07T20:25:54.8603114Z 2025-05-07T20:25:54.8603128Z 2025-05-07T20:25:54.8603134Z 2025-05-07T20:25:54.8603139Z 2025-05-07T20:25:54.8603144Z 2025-05-07T20:25:54.8603149Z 2025-05-07T20:25:54.8603154Z 2025-05-07T20:25:54.8604700Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:25:54.8605245Z 2025-05-07T20:25:54.8605250Z 2025-05-07T20:25:54.8605255Z 2025-05-07T20:25:54.8605260Z 2025-05-07T20:25:54.8605266Z 2025-05-07T20:25:54.8605271Z 2025-05-07T20:25:54.8605275Z 2025-05-07T20:25:54.8605279Z 2025-05-07T20:25:54.8605287Z 2025-05-07T20:25:54.8606103Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:25:54.8606477Z 2025-05-07T20:25:54.8606482Z 2025-05-07T20:25:54.8606487Z 2025-05-07T20:25:54.8606492Z 2025-05-07T20:25:54.8606497Z 2025-05-07T20:25:54.8606502Z 2025-05-07T20:25:54.8606507Z 2025-05-07T20:25:54.8606513Z 2025-05-07T20:25:54.8606518Z 2025-05-07T20:25:54.8606527Z 2025-05-07T20:25:54.8607582Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:25:54.8607956Z 2025-05-07T20:25:54.8607961Z 2025-05-07T20:25:54.8607966Z 2025-05-07T20:25:54.8607972Z 2025-05-07T20:25:54.8607977Z 2025-05-07T20:25:54.8607982Z 2025-05-07T20:25:54.8607997Z 2025-05-07T20:25:54.8608009Z 2025-05-07T20:25:54.8608014Z 2025-05-07T20:25:54.8608019Z 2025-05-07T20:25:54.8608028Z 2025-05-07T20:25:54.8610317Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:25:54.8610722Z 2025-05-07T20:25:54.8610728Z 2025-05-07T20:25:54.8610733Z 2025-05-07T20:25:54.8610738Z 2025-05-07T20:25:54.8610743Z 2025-05-07T20:25:54.8610748Z 2025-05-07T20:25:54.8610754Z 2025-05-07T20:25:54.8610759Z 2025-05-07T20:25:54.8610764Z 2025-05-07T20:25:54.8610769Z 2025-05-07T20:25:54.8610774Z 2025-05-07T20:25:54.8610779Z 2025-05-07T20:25:54.8612057Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:25:54.8612454Z 2025-05-07T20:25:54.8612459Z 2025-05-07T20:25:54.8612464Z 2025-05-07T20:25:54.8612469Z 2025-05-07T20:25:54.8612474Z 2025-05-07T20:25:54.8612480Z 2025-05-07T20:25:54.8612484Z 2025-05-07T20:25:54.8612490Z 2025-05-07T20:25:54.8612495Z 2025-05-07T20:25:54.8612500Z 2025-05-07T20:25:54.8612532Z 2025-05-07T20:25:54.8612538Z 2025-05-07T20:25:54.8612547Z 2025-05-07T20:25:54.8615350Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:25:54.8615748Z 2025-05-07T20:25:54.8615762Z 2025-05-07T20:25:54.8615767Z 2025-05-07T20:25:54.8615772Z 2025-05-07T20:25:54.8615777Z 2025-05-07T20:25:54.8615782Z 2025-05-07T20:25:54.8615787Z 2025-05-07T20:25:54.8615792Z 2025-05-07T20:25:54.8615797Z 2025-05-07T20:25:54.8615803Z 2025-05-07T20:25:54.8615808Z 2025-05-07T20:25:54.8615813Z 2025-05-07T20:25:54.8615818Z 2025-05-07T20:25:54.8615823Z 2025-05-07T20:25:54.8616948Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:25:54.8617360Z 2025-05-07T20:25:54.8617365Z 2025-05-07T20:25:54.8617370Z 2025-05-07T20:25:54.8617375Z 2025-05-07T20:25:54.8617380Z 2025-05-07T20:25:54.8617385Z 2025-05-07T20:25:54.8617391Z 2025-05-07T20:25:54.8617401Z 2025-05-07T20:25:54.8617406Z 2025-05-07T20:25:54.8617428Z 2025-05-07T20:25:54.8617433Z 2025-05-07T20:25:54.8617438Z 2025-05-07T20:25:54.8617443Z 2025-05-07T20:25:54.8617448Z 2025-05-07T20:25:54.8617453Z 2025-05-07T20:25:54.8618539Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:25:54.8618943Z 2025-05-07T20:25:54.8618948Z 2025-05-07T20:25:54.8618953Z 2025-05-07T20:25:54.8618959Z 2025-05-07T20:25:54.8618971Z 2025-05-07T20:25:54.8618977Z 2025-05-07T20:25:54.8618982Z 2025-05-07T20:25:54.8618987Z 2025-05-07T20:25:54.8618992Z 2025-05-07T20:25:54.8618997Z 2025-05-07T20:25:54.8619002Z 2025-05-07T20:25:54.8619007Z 2025-05-07T20:25:54.8619013Z 2025-05-07T20:25:54.8619021Z 2025-05-07T20:25:54.8619027Z 2025-05-07T20:25:54.8619032Z 2025-05-07T20:25:54.8621623Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:25:54.8622063Z 2025-05-07T20:25:54.8622068Z 2025-05-07T20:25:54.8622074Z 2025-05-07T20:25:54.8622231Z 2025-05-07T20:25:54.8622326Z 2025-05-07T20:25:54.8622331Z 2025-05-07T20:25:54.8622336Z 2025-05-07T20:25:54.8622341Z 2025-05-07T20:25:54.8622346Z 2025-05-07T20:25:54.8622352Z 2025-05-07T20:25:54.8622357Z 2025-05-07T20:25:54.8622362Z 2025-05-07T20:25:54.8622367Z 2025-05-07T20:25:54.8622372Z 2025-05-07T20:25:54.8622386Z 2025-05-07T20:25:54.8622392Z 2025-05-07T20:25:54.8622397Z 2025-05-07T20:25:54.8623425Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:25:54.8623849Z 2025-05-07T20:25:54.8623854Z 2025-05-07T20:25:54.8623875Z 2025-05-07T20:25:54.8623881Z 2025-05-07T20:25:54.8623886Z 2025-05-07T20:25:54.8623891Z 2025-05-07T20:25:54.8623896Z 2025-05-07T20:25:54.8623901Z 2025-05-07T20:25:54.8623906Z 2025-05-07T20:25:54.8623911Z 2025-05-07T20:25:54.8623916Z 2025-05-07T20:25:54.8623921Z 2025-05-07T20:25:54.8623926Z 2025-05-07T20:25:54.8623932Z 2025-05-07T20:25:54.8623937Z 2025-05-07T20:25:54.8623942Z 2025-05-07T20:25:54.8623965Z 2025-05-07T20:25:54.8623971Z 2025-05-07T20:25:54.8624963Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:25:54.8625372Z 2025-05-07T20:25:54.8625377Z 2025-05-07T20:25:54.8625382Z 2025-05-07T20:25:54.8625387Z 2025-05-07T20:25:54.8625392Z 2025-05-07T20:25:54.8625397Z 2025-05-07T20:25:54.8625402Z 2025-05-07T20:25:54.8625408Z 2025-05-07T20:25:54.8625413Z 2025-05-07T20:25:54.8625418Z 2025-05-07T20:25:54.8625440Z 2025-05-07T20:25:54.8625445Z 2025-05-07T20:25:54.8625450Z 2025-05-07T20:25:54.8625455Z 2025-05-07T20:25:54.8625460Z 2025-05-07T20:25:54.8625465Z 2025-05-07T20:25:54.8625470Z 2025-05-07T20:25:54.8625475Z 2025-05-07T20:25:54.8625480Z 2025-05-07T20:25:54.9547594Z ... (more hidden) ... 2025-05-07T20:25:54.9547942Z 2025-05-07T20:25:54.9547948Z 2025-05-07T20:25:54.9553936Z libcufft-11.3.0.4 | 156.2 MB | 2 | 2%  2025-05-07T20:25:54.9565055Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:25:54.9565413Z 2025-05-07T20:25:54.9565428Z 2025-05-07T20:25:54.9565433Z 2025-05-07T20:25:54.9565476Z 2025-05-07T20:25:54.9698589Z cuda-nsight-12.6.77 | 113.2 MB | 2 | 3%  2025-05-07T20:25:54.9698971Z 2025-05-07T20:25:54.9698977Z 2025-05-07T20:25:54.9699538Z 2025-05-07T20:25:54.9702540Z libcusparse-12.5.4.2 | 118.6 MB | 1 | 2%  2025-05-07T20:25:54.9702918Z 2025-05-07T20:25:55.0552405Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:25:55.0552683Z 2025-05-07T20:25:55.0554509Z 2025-05-07T20:25:55.0560344Z libcufft-11.3.0.4 | 156.2 MB | 5 | 6%  2025-05-07T20:25:55.0566677Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:25:55.0566940Z 2025-05-07T20:25:55.0566944Z 2025-05-07T20:25:55.0566948Z 2025-05-07T20:25:55.0567592Z 2025-05-07T20:25:55.0703682Z cuda-nsight-12.6.77 | 113.2 MB | 6 | 6%  2025-05-07T20:25:55.0703992Z 2025-05-07T20:25:55.0703996Z 2025-05-07T20:25:55.0704000Z 2025-05-07T20:25:55.0707868Z libcusparse-12.5.4.2 | 118.6 MB | 5 | 5%  2025-05-07T20:25:55.0712323Z 2025-05-07T20:25:55.1562061Z libcublas-12.6.4.1 | 256.2 MB | 1 | 1%  2025-05-07T20:25:55.1569991Z nsight-compute-2024. | 443.1 MB | 1 | 2% 2025-05-07T20:25:55.1570358Z 2025-05-07T20:25:55.1570365Z 2025-05-07T20:25:55.1570370Z 2025-05-07T20:25:55.1572683Z 2025-05-07T20:25:55.1704093Z cuda-nsight-12.6.77 | 113.2 MB | 9 | 9%  2025-05-07T20:25:55.1704391Z 2025-05-07T20:25:55.1704396Z 2025-05-07T20:25:55.1704418Z 2025-05-07T20:25:55.1710315Z libcusparse-12.5.4.2 | 118.6 MB | 8 | 8%  2025-05-07T20:25:55.1710660Z 2025-05-07T20:25:55.1811436Z libcublas-12.6.4.1 | 256.2 MB | 2 | 2%  2025-05-07T20:25:55.1811728Z 2025-05-07T20:25:55.1811733Z 2025-05-07T20:25:55.2563733Z libcufft-11.3.0.4 | 156.2 MB | 8 | 9%  2025-05-07T20:25:55.2573850Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:25:55.2574118Z 2025-05-07T20:25:55.2574256Z 2025-05-07T20:25:55.2574264Z 2025-05-07T20:25:55.2574297Z 2025-05-07T20:25:55.2707806Z cuda-nsight-12.6.77 | 113.2 MB | #2 | 13%  2025-05-07T20:25:55.2708093Z 2025-05-07T20:25:55.2708098Z 2025-05-07T20:25:55.2708101Z 2025-05-07T20:25:55.2710985Z libcusparse-12.5.4.2 | 118.6 MB | #1 | 11%  2025-05-07T20:25:55.2711367Z 2025-05-07T20:25:55.3127544Z libcublas-12.6.4.1 | 256.2 MB | 3 | 4%  2025-05-07T20:25:55.3127863Z 2025-05-07T20:25:55.3127867Z 2025-05-07T20:25:55.3565295Z libcufft-11.3.0.4 | 156.2 MB | #1 | 12%  2025-05-07T20:25:55.3645815Z nsight-compute-2024. | 443.1 MB | 3 | 3% 2025-05-07T20:25:55.3646126Z 2025-05-07T20:25:55.3646231Z 2025-05-07T20:25:55.3646236Z 2025-05-07T20:25:55.3646753Z 2025-05-07T20:25:55.3714174Z cuda-nsight-12.6.77 | 113.2 MB | #5 | 16%  2025-05-07T20:25:55.3714704Z 2025-05-07T20:25:55.3772330Z libcublas-12.6.4.1 | 256.2 MB | 4 | 5%  2025-05-07T20:25:55.3772717Z 2025-05-07T20:25:55.3772723Z 2025-05-07T20:25:55.3774420Z 2025-05-07T20:25:55.4306518Z libcusparse-12.5.4.2 | 118.6 MB | #4 | 14%  2025-05-07T20:25:55.4306818Z 2025-05-07T20:25:55.4308508Z 2025-05-07T20:25:55.4647516Z libcufft-11.3.0.4 | 156.2 MB | #4 | 14%  2025-05-07T20:25:55.4647876Z 2025-05-07T20:25:55.4647883Z 2025-05-07T20:25:55.4647889Z 2025-05-07T20:25:55.4647894Z 2025-05-07T20:25:55.4717680Z cuda-nsight-12.6.77 | 113.2 MB | #9 | 19%  2025-05-07T20:25:55.4718103Z 2025-05-07T20:25:55.4775613Z libcublas-12.6.4.1 | 256.2 MB | 6 | 6%  2025-05-07T20:25:55.4776247Z 2025-05-07T20:25:55.4776253Z 2025-05-07T20:25:55.4776579Z 2025-05-07T20:25:55.5215810Z libcusparse-12.5.4.2 | 118.6 MB | #7 | 17%  2025-05-07T20:25:55.5307049Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:25:55.5307537Z 2025-05-07T20:25:55.5307541Z 2025-05-07T20:25:55.5719427Z libcufft-11.3.0.4 | 156.2 MB | #7 | 17%  2025-05-07T20:25:55.5723910Z 2025-05-07T20:25:55.5778920Z libcublas-12.6.4.1 | 256.2 MB | 8 | 8%  2025-05-07T20:25:55.5779188Z 2025-05-07T20:25:55.5779192Z 2025-05-07T20:25:55.5781445Z 2025-05-07T20:25:55.6218264Z libcusparse-12.5.4.2 | 118.6 MB | ##1 | 21%  2025-05-07T20:25:55.6404538Z nsight-compute-2024. | 443.1 MB | 4 | 5% 2025-05-07T20:25:55.6404805Z 2025-05-07T20:25:55.6404818Z 2025-05-07T20:25:55.6678883Z libcufft-11.3.0.4 | 156.2 MB | #9 | 20%  2025-05-07T20:25:55.6679157Z 2025-05-07T20:25:55.6679161Z 2025-05-07T20:25:55.6679164Z 2025-05-07T20:25:55.6679659Z 2025-05-07T20:25:55.6857979Z cuda-nsight-12.6.77 | 113.2 MB | ##2 | 23%  2025-05-07T20:25:55.6858318Z 2025-05-07T20:25:55.6858322Z 2025-05-07T20:25:55.6861379Z 2025-05-07T20:25:55.7219148Z libcusparse-12.5.4.2 | 118.6 MB | ##4 | 25%  2025-05-07T20:25:55.7511508Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:25:55.7511774Z 2025-05-07T20:25:55.7513311Z 2025-05-07T20:25:55.7619953Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 22%  2025-05-07T20:25:55.7622443Z 2025-05-07T20:25:55.7682250Z libcublas-12.6.4.1 | 256.2 MB | 9 | 10%  2025-05-07T20:25:55.7682810Z 2025-05-07T20:25:55.7682814Z 2025-05-07T20:25:55.7682818Z 2025-05-07T20:25:55.7683167Z 2025-05-07T20:25:55.7858645Z cuda-nsight-12.6.77 | 113.2 MB | ##6 | 26%  2025-05-07T20:25:55.7859066Z 2025-05-07T20:25:55.7859073Z 2025-05-07T20:25:55.7866186Z 2025-05-07T20:25:55.8219782Z libcusparse-12.5.4.2 | 118.6 MB | ##8 | 28%  2025-05-07T20:25:55.8575112Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:25:55.8575385Z 2025-05-07T20:25:55.8576619Z 2025-05-07T20:25:55.8683744Z libcufft-11.3.0.4 | 156.2 MB | ##4 | 25%  2025-05-07T20:25:55.8684019Z 2025-05-07T20:25:55.8684023Z 2025-05-07T20:25:55.8684027Z 2025-05-07T20:25:55.8684031Z 2025-05-07T20:25:55.8722251Z cuda-nsight-12.6.77 | 113.2 MB | ##9 | 29%  2025-05-07T20:25:55.8722902Z 2025-05-07T20:25:55.9073824Z libcublas-12.6.4.1 | 256.2 MB | # | 11%  2025-05-07T20:25:55.9074094Z 2025-05-07T20:25:55.9074098Z 2025-05-07T20:25:55.9079288Z 2025-05-07T20:25:55.9250746Z libcusparse-12.5.4.2 | 118.6 MB | ###1 | 31%  2025-05-07T20:25:55.9617379Z nsight-compute-2024. | 443.1 MB | 7 | 7% 2025-05-07T20:25:55.9617665Z 2025-05-07T20:25:55.9617671Z 2025-05-07T20:25:55.9684911Z libcufft-11.3.0.4 | 156.2 MB | ##7 | 27%  2025-05-07T20:25:55.9685179Z 2025-05-07T20:25:55.9685183Z 2025-05-07T20:25:55.9685187Z 2025-05-07T20:25:55.9685852Z 2025-05-07T20:25:55.9724697Z cuda-nsight-12.6.77 | 113.2 MB | ###2 | 33%  2025-05-07T20:25:55.9725063Z 2025-05-07T20:25:56.0132337Z libcublas-12.6.4.1 | 256.2 MB | #2 | 12%  2025-05-07T20:25:56.0132611Z 2025-05-07T20:25:56.0132615Z 2025-05-07T20:25:56.0132619Z 2025-05-07T20:25:56.0253109Z libcusparse-12.5.4.2 | 118.6 MB | ###4 | 34%  2025-05-07T20:25:56.0639517Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:25:56.0639791Z 2025-05-07T20:25:56.0639795Z 2025-05-07T20:25:56.0700205Z libcufft-11.3.0.4 | 156.2 MB | ##9 | 30%  2025-05-07T20:25:56.0700526Z 2025-05-07T20:25:56.0700532Z 2025-05-07T20:25:56.0700538Z 2025-05-07T20:25:56.0703002Z 2025-05-07T20:25:56.0726448Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 36%  2025-05-07T20:25:56.0726917Z 2025-05-07T20:25:56.1253979Z libcublas-12.6.4.1 | 256.2 MB | #3 | 13%  2025-05-07T20:25:56.1643414Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:25:56.1643706Z 2025-05-07T20:25:56.1644487Z 2025-05-07T20:25:56.1700818Z libcufft-11.3.0.4 | 156.2 MB | ###2 | 33%  2025-05-07T20:25:56.1701168Z 2025-05-07T20:25:56.1701172Z 2025-05-07T20:25:56.1701176Z 2025-05-07T20:25:56.1701978Z 2025-05-07T20:25:56.1727426Z cuda-nsight-12.6.77 | 113.2 MB | ###9 | 40%  2025-05-07T20:25:56.1727704Z 2025-05-07T20:25:56.2255450Z libcublas-12.6.4.1 | 256.2 MB | #5 | 15%  2025-05-07T20:25:56.2646814Z nsight-compute-2024. | 443.1 MB | # | 10% 2025-05-07T20:25:56.2647185Z 2025-05-07T20:25:56.2647191Z 2025-05-07T20:25:56.2650662Z libcufft-11.3.0.4 | 156.2 MB | ###5 | 36%  2025-05-07T20:25:56.2650923Z 2025-05-07T20:25:56.2651069Z 2025-05-07T20:25:56.2651109Z 2025-05-07T20:25:56.2704204Z libcusparse-12.5.4.2 | 118.6 MB | ###7 | 38%  2025-05-07T20:25:56.2704555Z 2025-05-07T20:25:56.2704559Z 2025-05-07T20:25:56.2704562Z 2025-05-07T20:25:56.2704566Z 2025-05-07T20:25:56.2823214Z cuda-nsight-12.6.77 | 113.2 MB | ####3 | 44%  2025-05-07T20:25:56.2823569Z 2025-05-07T20:25:56.3259005Z libcublas-12.6.4.1 | 256.2 MB | #6 | 17%  2025-05-07T20:25:56.3672751Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:25:56.3673033Z 2025-05-07T20:25:56.3673668Z 2025-05-07T20:25:56.3707068Z libcufft-11.3.0.4 | 156.2 MB | ###8 | 39%  2025-05-07T20:25:56.3707369Z 2025-05-07T20:25:56.3707373Z 2025-05-07T20:25:56.3707377Z 2025-05-07T20:25:56.3707381Z 2025-05-07T20:25:56.3807227Z cuda-nsight-12.6.77 | 113.2 MB | ####7 | 48%  2025-05-07T20:25:56.3807509Z 2025-05-07T20:25:56.3807513Z 2025-05-07T20:25:56.3807517Z 2025-05-07T20:25:56.3999268Z libcusparse-12.5.4.2 | 118.6 MB | #### | 40%  2025-05-07T20:25:56.4002067Z 2025-05-07T20:25:56.4426267Z libcublas-12.6.4.1 | 256.2 MB | #8 | 18%  2025-05-07T20:25:56.4807716Z nsight-compute-2024. | 443.1 MB | #2 | 12% 2025-05-07T20:25:56.4808000Z 2025-05-07T20:25:56.4808134Z 2025-05-07T20:25:56.4809985Z 2025-05-07T20:25:56.4856061Z libcusparse-12.5.4.2 | 118.6 MB | ####2 | 43%  2025-05-07T20:25:56.4856352Z 2025-05-07T20:25:56.4856356Z 2025-05-07T20:25:56.4856360Z 2025-05-07T20:25:56.4858188Z 2025-05-07T20:25:56.4942718Z cuda-nsight-12.6.77 | 113.2 MB | #####1 | 51%  2025-05-07T20:25:56.4943120Z 2025-05-07T20:25:56.4943127Z 2025-05-07T20:25:56.5013516Z libcufft-11.3.0.4 | 156.2 MB | ####1 | 41%  2025-05-07T20:25:56.5014692Z 2025-05-07T20:25:56.5601619Z libcublas-12.6.4.1 | 256.2 MB | #9 | 19%  2025-05-07T20:25:56.5808954Z nsight-compute-2024. | 443.1 MB | #3 | 13% 2025-05-07T20:25:56.5809216Z 2025-05-07T20:25:56.5809220Z 2025-05-07T20:25:56.5809224Z 2025-05-07T20:25:56.6016622Z libcusparse-12.5.4.2 | 118.6 MB | ####5 | 46%  2025-05-07T20:25:56.6021003Z 2025-05-07T20:25:56.6052114Z libcublas-12.6.4.1 | 256.2 MB | ## | 21%  2025-05-07T20:25:56.6052445Z 2025-05-07T20:25:56.6054969Z 2025-05-07T20:25:56.6150763Z libcufft-11.3.0.4 | 156.2 MB | ####3 | 44%  2025-05-07T20:25:56.6151136Z 2025-05-07T20:25:56.6151141Z 2025-05-07T20:25:56.6151147Z 2025-05-07T20:25:56.6151798Z 2025-05-07T20:25:56.6816694Z cuda-nsight-12.6.77 | 113.2 MB | #####4 | 55%  2025-05-07T20:25:56.6816999Z 2025-05-07T20:25:56.6817003Z 2025-05-07T20:25:56.6820380Z 2025-05-07T20:25:56.7019795Z libcusparse-12.5.4.2 | 118.6 MB | ####8 | 48%  2025-05-07T20:25:56.7024164Z 2025-05-07T20:25:56.7052283Z libcublas-12.6.4.1 | 256.2 MB | ##2 | 23%  2025-05-07T20:25:56.7052643Z 2025-05-07T20:25:56.7053379Z 2025-05-07T20:25:56.7154339Z libcufft-11.3.0.4 | 156.2 MB | ####6 | 47%  2025-05-07T20:25:56.7154983Z 2025-05-07T20:25:56.7154989Z 2025-05-07T20:25:56.7154994Z 2025-05-07T20:25:56.7156644Z 2025-05-07T20:25:56.7179422Z cuda-nsight-12.6.77 | 113.2 MB | #####8 | 59%  2025-05-07T20:25:56.7817237Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:25:56.7817595Z 2025-05-07T20:25:56.7817601Z 2025-05-07T20:25:56.7817606Z 2025-05-07T20:25:56.8086877Z libcusparse-12.5.4.2 | 118.6 MB | #####1 | 51%  2025-05-07T20:25:56.8089284Z 2025-05-07T20:25:56.8129126Z libcublas-12.6.4.1 | 256.2 MB | ##4 | 24%  2025-05-07T20:25:56.8129456Z 2025-05-07T20:25:56.8129469Z 2025-05-07T20:25:56.8252555Z libcufft-11.3.0.4 | 156.2 MB | ####9 | 49%  2025-05-07T20:25:56.8252847Z 2025-05-07T20:25:56.8252852Z 2025-05-07T20:25:56.8252856Z 2025-05-07T20:25:56.8254858Z 2025-05-07T20:25:56.8288604Z cuda-nsight-12.6.77 | 113.2 MB | ######2 | 62%  2025-05-07T20:25:56.8818275Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:25:56.8818548Z 2025-05-07T20:25:56.8818552Z 2025-05-07T20:25:56.8820058Z 2025-05-07T20:25:56.9134829Z libcusparse-12.5.4.2 | 118.6 MB | #####4 | 54%  2025-05-07T20:25:56.9135939Z 2025-05-07T20:25:56.9218771Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 25%  2025-05-07T20:25:56.9219299Z 2025-05-07T20:25:56.9219303Z 2025-05-07T20:25:56.9288460Z libcufft-11.3.0.4 | 156.2 MB | #####1 | 52%  2025-05-07T20:25:56.9288793Z 2025-05-07T20:25:56.9288800Z 2025-05-07T20:25:56.9288805Z 2025-05-07T20:25:56.9288810Z 2025-05-07T20:25:56.9304955Z cuda-nsight-12.6.77 | 113.2 MB | ######5 | 66%  2025-05-07T20:25:56.9820700Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:25:56.9821076Z 2025-05-07T20:25:56.9821147Z 2025-05-07T20:25:56.9821261Z 2025-05-07T20:25:57.0206643Z libcusparse-12.5.4.2 | 118.6 MB | #####7 | 57%  2025-05-07T20:25:57.0209436Z 2025-05-07T20:25:57.0262597Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 27%  2025-05-07T20:25:57.0262938Z 2025-05-07T20:25:57.0264972Z 2025-05-07T20:25:57.0318080Z libcufft-11.3.0.4 | 156.2 MB | #####4 | 54%  2025-05-07T20:25:57.0318365Z 2025-05-07T20:25:57.0318369Z 2025-05-07T20:25:57.0318794Z 2025-05-07T20:25:57.0318799Z 2025-05-07T20:25:57.0330385Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 69%  2025-05-07T20:25:57.0822760Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:25:57.0823058Z 2025-05-07T20:25:57.0823063Z 2025-05-07T20:25:57.0823068Z 2025-05-07T20:25:57.1265377Z libcusparse-12.5.4.2 | 118.6 MB | ###### | 61%  2025-05-07T20:25:57.1265675Z 2025-05-07T20:25:57.1266104Z 2025-05-07T20:25:57.1311928Z libcufft-11.3.0.4 | 156.2 MB | #####6 | 57%  2025-05-07T20:25:57.1312279Z 2025-05-07T20:25:57.1332176Z libcublas-12.6.4.1 | 256.2 MB | ##8 | 28%  2025-05-07T20:25:57.1395592Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:25:57.1395899Z 2025-05-07T20:25:57.1395910Z 2025-05-07T20:25:57.1395914Z 2025-05-07T20:25:57.1399544Z 2025-05-07T20:25:57.1827057Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 73%  2025-05-07T20:25:57.1827350Z 2025-05-07T20:25:57.1827395Z 2025-05-07T20:25:57.1828041Z 2025-05-07T20:25:57.2292662Z libcusparse-12.5.4.2 | 118.6 MB | ######3 | 64%  2025-05-07T20:25:57.2293049Z 2025-05-07T20:25:57.2294748Z 2025-05-07T20:25:57.2315625Z libcufft-11.3.0.4 | 156.2 MB | #####9 | 59%  2025-05-07T20:25:57.2318503Z 2025-05-07T20:25:57.2334110Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 30%  2025-05-07T20:25:57.2396973Z nsight-compute-2024. | 443.1 MB | #7 | 18% 2025-05-07T20:25:57.2397244Z 2025-05-07T20:25:57.2397248Z 2025-05-07T20:25:57.2397252Z 2025-05-07T20:25:57.2398636Z 2025-05-07T20:25:57.2829911Z cuda-nsight-12.6.77 | 113.2 MB | #######5 | 76%  2025-05-07T20:25:57.2830203Z 2025-05-07T20:25:57.2830207Z 2025-05-07T20:25:57.2832199Z 2025-05-07T20:25:57.3332080Z libcusparse-12.5.4.2 | 118.6 MB | ######6 | 67%  2025-05-07T20:25:57.3334592Z 2025-05-07T20:25:57.3340924Z libcublas-12.6.4.1 | 256.2 MB | ###1 | 31%  2025-05-07T20:25:57.3345968Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:25:57.3346409Z 2025-05-07T20:25:57.3346421Z 2025-05-07T20:25:57.3399513Z libcufft-11.3.0.4 | 156.2 MB | ######1 | 62%  2025-05-07T20:25:57.3399986Z 2025-05-07T20:25:57.3399992Z 2025-05-07T20:25:57.3400008Z 2025-05-07T20:25:57.3400013Z 2025-05-07T20:25:57.3831519Z cuda-nsight-12.6.77 | 113.2 MB | #######9 | 79%  2025-05-07T20:25:57.3831811Z 2025-05-07T20:25:57.3831815Z 2025-05-07T20:25:57.3831819Z 2025-05-07T20:25:57.4339627Z libcusparse-12.5.4.2 | 118.6 MB | ######9 | 70%  2025-05-07T20:25:57.4360089Z nsight-compute-2024. | 443.1 MB | #9 | 20% 2025-05-07T20:25:57.4360339Z 2025-05-07T20:25:57.4362577Z 2025-05-07T20:25:57.4402764Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 64%  2025-05-07T20:25:57.4403022Z 2025-05-07T20:25:57.4403026Z 2025-05-07T20:25:57.4403029Z 2025-05-07T20:25:57.4403041Z 2025-05-07T20:25:57.4416529Z cuda-nsight-12.6.77 | 113.2 MB | ########2 | 83%  2025-05-07T20:25:57.4419915Z 2025-05-07T20:25:57.4904618Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 32%  2025-05-07T20:25:57.4904883Z 2025-05-07T20:25:57.4905156Z 2025-05-07T20:25:57.4905259Z 2025-05-07T20:25:57.5344593Z libcusparse-12.5.4.2 | 118.6 MB | #######2 | 73%  2025-05-07T20:25:57.5365553Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:25:57.5365902Z 2025-05-07T20:25:57.5368021Z 2025-05-07T20:25:57.5403744Z libcufft-11.3.0.4 | 156.2 MB | ######6 | 66%  2025-05-07T20:25:57.5404101Z 2025-05-07T20:25:57.5404105Z 2025-05-07T20:25:57.5404109Z 2025-05-07T20:25:57.5404113Z 2025-05-07T20:25:57.5418252Z cuda-nsight-12.6.77 | 113.2 MB | ########6 | 86%  2025-05-07T20:25:57.5421665Z 2025-05-07T20:25:57.5908111Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 34%  2025-05-07T20:25:57.5908386Z 2025-05-07T20:25:57.5908390Z 2025-05-07T20:25:57.5908394Z 2025-05-07T20:25:57.6348796Z libcusparse-12.5.4.2 | 118.6 MB | #######5 | 76%  2025-05-07T20:25:57.6383795Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:25:57.6384051Z 2025-05-07T20:25:57.6384812Z 2025-05-07T20:25:57.6408070Z libcufft-11.3.0.4 | 156.2 MB | ######8 | 69%  2025-05-07T20:25:57.6408400Z 2025-05-07T20:25:57.6408406Z 2025-05-07T20:25:57.6408411Z 2025-05-07T20:25:57.6408417Z 2025-05-07T20:25:57.6420491Z cuda-nsight-12.6.77 | 113.2 MB | ########9 | 90%  2025-05-07T20:25:57.6422298Z 2025-05-07T20:25:57.6977647Z libcublas-12.6.4.1 | 256.2 MB | ###5 | 35%  2025-05-07T20:25:57.6977921Z 2025-05-07T20:25:57.6977925Z 2025-05-07T20:25:57.6978333Z 2025-05-07T20:25:57.7384525Z libcusparse-12.5.4.2 | 118.6 MB | #######8 | 79%  2025-05-07T20:25:57.7384816Z 2025-05-07T20:25:57.7385306Z 2025-05-07T20:25:57.7409954Z libcufft-11.3.0.4 | 156.2 MB | #######1 | 71%  2025-05-07T20:25:57.7410224Z 2025-05-07T20:25:57.7410228Z 2025-05-07T20:25:57.7410231Z 2025-05-07T20:25:57.7410570Z 2025-05-07T20:25:57.7862424Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 93%  2025-05-07T20:25:57.7977604Z nsight-compute-2024. | 443.1 MB | ##2 | 22% 2025-05-07T20:25:57.7977863Z 2025-05-07T20:25:57.7977867Z 2025-05-07T20:25:57.7979329Z 2025-05-07T20:25:57.8388155Z libcusparse-12.5.4.2 | 118.6 MB | ########2 | 83%  2025-05-07T20:25:57.8388447Z 2025-05-07T20:25:57.8388451Z 2025-05-07T20:25:57.8410187Z libcufft-11.3.0.4 | 156.2 MB | #######4 | 75%  2025-05-07T20:25:57.8410512Z 2025-05-07T20:25:57.8410518Z 2025-05-07T20:25:57.8410524Z 2025-05-07T20:25:57.8410529Z 2025-05-07T20:25:57.8863435Z cuda-nsight-12.6.77 | 113.2 MB | #########7 | 98%  2025-05-07T20:25:57.8978341Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:25:57.8978607Z 2025-05-07T20:25:57.8978611Z 2025-05-07T20:25:57.8979246Z 2025-05-07T20:25:57.9145594Z libcusparse-12.5.4.2 | 118.6 MB | ########6 | 87%  2025-05-07T20:25:57.9145917Z 2025-05-07T20:25:57.9388020Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 37%  2025-05-07T20:25:57.9388276Z 2025-05-07T20:25:57.9389004Z 2025-05-07T20:25:57.9867386Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 78%  2025-05-07T20:25:57.9979818Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:25:57.9980078Z 2025-05-07T20:25:57.9980082Z 2025-05-07T20:25:57.9980720Z 2025-05-07T20:25:58.0147289Z libcusparse-12.5.4.2 | 118.6 MB | ######### | 90%  2025-05-07T20:25:58.0147563Z 2025-05-07T20:25:58.0518474Z libcublas-12.6.4.1 | 256.2 MB | ###7 | 38%  2025-05-07T20:25:58.0518743Z 2025-05-07T20:25:58.0518747Z 2025-05-07T20:25:58.0869227Z libcufft-11.3.0.4 | 156.2 MB | ######## | 81%  2025-05-07T20:25:58.1096349Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:25:58.1096751Z 2025-05-07T20:25:58.1096755Z 2025-05-07T20:25:58.1098947Z 2025-05-07T20:25:58.1147439Z libcusparse-12.5.4.2 | 118.6 MB | #########3 | 94%  2025-05-07T20:25:58.1149694Z 2025-05-07T20:25:58.1643588Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 39%  2025-05-07T20:25:58.1643858Z 2025-05-07T20:25:58.1643862Z 2025-05-07T20:25:58.1872247Z libcufft-11.3.0.4 | 156.2 MB | ########3 | 84%  2025-05-07T20:25:58.2150381Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:25:58.2152998Z 2025-05-07T20:25:58.2190743Z libcublas-12.6.4.1 | 256.2 MB | #### | 41%  2025-05-07T20:25:58.2191017Z 2025-05-07T20:25:58.2191022Z 2025-05-07T20:25:58.2192440Z 2025-05-07T20:25:58.2658372Z libcusparse-12.5.4.2 | 118.6 MB | #########7 | 97%  2025-05-07T20:25:58.2658668Z 2025-05-07T20:25:58.2659139Z 2025-05-07T20:25:58.2925837Z libcufft-11.3.0.4 | 156.2 MB | ########6 | 86%  2025-05-07T20:25:58.3151190Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:25:58.3152488Z 2025-05-07T20:25:58.3867843Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 42%  2025-05-07T20:25:58.3868540Z 2025-05-07T20:25:58.3868545Z 2025-05-07T20:25:58.3931668Z libcufft-11.3.0.4 | 156.2 MB | ########9 | 89%  2025-05-07T20:25:58.4151205Z nsight-compute-2024. | 443.1 MB | ##8 | 28% 2025-05-07T20:25:58.4152894Z 2025-05-07T20:25:58.4871030Z libcublas-12.6.4.1 | 256.2 MB | ####3 | 44%  2025-05-07T20:25:58.4871334Z 2025-05-07T20:25:58.4871864Z 2025-05-07T20:25:58.4976286Z libcufft-11.3.0.4 | 156.2 MB | #########1 | 92%  2025-05-07T20:25:58.5152606Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:25:58.5154310Z 2025-05-07T20:25:58.5873185Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 46%  2025-05-07T20:25:58.5873439Z 2025-05-07T20:25:58.5873452Z 2025-05-07T20:25:58.5976837Z libcufft-11.3.0.4 | 156.2 MB | #########5 | 95%  2025-05-07T20:25:58.6551064Z nsight-compute-2024. | 443.1 MB | ### | 30% 2025-05-07T20:25:58.6554228Z 2025-05-07T20:25:58.6873874Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 47%  2025-05-07T20:25:58.6874195Z 2025-05-07T20:25:58.6875385Z 2025-05-07T20:25:58.6979815Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 98%  2025-05-07T20:25:58.7554017Z nsight-compute-2024. | 443.1 MB | ###1 | 31% 2025-05-07T20:25:58.7556611Z 2025-05-07T20:25:58.7981265Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 49%  2025-05-07T20:25:58.8554740Z nsight-compute-2024. | 443.1 MB | ###2 | 32% 2025-05-07T20:25:58.8555165Z 2025-05-07T20:25:58.8982561Z libcublas-12.6.4.1 | 256.2 MB | ##### | 51%  2025-05-07T20:25:58.9556825Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:25:58.9558856Z 2025-05-07T20:25:58.9985201Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 53%  2025-05-07T20:25:59.0556737Z nsight-compute-2024. | 443.1 MB | ###4 | 35% 2025-05-07T20:25:59.0559402Z 2025-05-07T20:25:59.0985513Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 55%  2025-05-07T20:25:59.1557550Z nsight-compute-2024. | 443.1 MB | ###5 | 36% 2025-05-07T20:25:59.1557963Z 2025-05-07T20:25:59.2260434Z libcublas-12.6.4.1 | 256.2 MB | #####7 | 57%  2025-05-07T20:25:59.2698873Z nsight-compute-2024. | 443.1 MB | ###6 | 37% 2025-05-07T20:25:59.2699247Z 2025-05-07T20:25:59.3284433Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 59%  2025-05-07T20:25:59.3741410Z nsight-compute-2024. | 443.1 MB | ###7 | 38% 2025-05-07T20:25:59.3741787Z 2025-05-07T20:25:59.4312977Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 61%  2025-05-07T20:25:59.4840065Z nsight-compute-2024. | 443.1 MB | ###8 | 39% 2025-05-07T20:25:59.4840401Z 2025-05-07T20:25:59.5386334Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 63%  2025-05-07T20:25:59.5876045Z nsight-compute-2024. | 443.1 MB | ###9 | 40% 2025-05-07T20:25:59.5876431Z 2025-05-07T20:25:59.6392875Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 65%  2025-05-07T20:25:59.6878019Z nsight-compute-2024. | 443.1 MB | #### | 41% 2025-05-07T20:25:59.6882967Z 2025-05-07T20:25:59.7396020Z libcublas-12.6.4.1 | 256.2 MB | ######6 | 67%  2025-05-07T20:25:59.7880465Z nsight-compute-2024. | 443.1 MB | ####1 | 42% 2025-05-07T20:25:59.7883026Z 2025-05-07T20:25:59.8398876Z libcublas-12.6.4.1 | 256.2 MB | ######8 | 69%  2025-05-07T20:25:59.8881861Z nsight-compute-2024. | 443.1 MB | ####2 | 43% 2025-05-07T20:25:59.8884531Z 2025-05-07T20:25:59.9416055Z libcublas-12.6.4.1 | 256.2 MB | ####### | 71%  2025-05-07T20:25:59.9882425Z nsight-compute-2024. | 443.1 MB | ####3 | 44% 2025-05-07T20:25:59.9884040Z 2025-05-07T20:26:00.0448591Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:26:00.0448856Z 2025-05-07T20:26:00.0448860Z 2025-05-07T20:26:00.0448874Z 2025-05-07T20:26:00.0448878Z 2025-05-07T20:26:00.0896739Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:00.0897101Z 2025-05-07T20:26:00.1018290Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 75%  2025-05-07T20:26:00.1018732Z 2025-05-07T20:26:00.1018736Z 2025-05-07T20:26:00.1018740Z 2025-05-07T20:26:00.1018743Z 2025-05-07T20:26:00.1019240Z 2025-05-07T20:26:00.1263893Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:00.2020478Z nsight-compute-2024. | 443.1 MB | ####5 | 45% 2025-05-07T20:26:00.2020755Z 2025-05-07T20:26:00.2020759Z 2025-05-07T20:26:00.2020763Z 2025-05-07T20:26:00.2020766Z 2025-05-07T20:26:00.2021960Z 2025-05-07T20:26:00.2245232Z cuda-nvvp-12.6.80 | 109.3 MB | 3 | 3%  2025-05-07T20:26:00.2247475Z 2025-05-07T20:26:00.2330781Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 77%  2025-05-07T20:26:00.3024539Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:26:00.3024980Z 2025-05-07T20:26:00.3024986Z 2025-05-07T20:26:00.3024992Z 2025-05-07T20:26:00.3024998Z 2025-05-07T20:26:00.3025003Z 2025-05-07T20:26:00.3047365Z cuda-nvvp-12.6.80 | 109.3 MB | 6 | 7%  2025-05-07T20:26:00.3047798Z 2025-05-07T20:26:00.3047804Z 2025-05-07T20:26:00.3050802Z 2025-05-07T20:26:00.3459197Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:00.3653821Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:26:00.3659184Z 2025-05-07T20:26:00.3666225Z libcublas-12.6.4.1 | 256.2 MB | #######8 | 79%  2025-05-07T20:26:00.3666488Z 2025-05-07T20:26:00.3666492Z 2025-05-07T20:26:00.3666495Z 2025-05-07T20:26:00.3666499Z 2025-05-07T20:26:00.3666503Z 2025-05-07T20:26:00.3666748Z 2025-05-07T20:26:00.4026207Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:00.4026506Z 2025-05-07T20:26:00.4026510Z 2025-05-07T20:26:00.4026514Z 2025-05-07T20:26:00.4026518Z 2025-05-07T20:26:00.4026522Z 2025-05-07T20:26:00.4658917Z cuda-nvvp-12.6.80 | 109.3 MB | 9 | 10%  2025-05-07T20:26:00.4659208Z 2025-05-07T20:26:00.4659213Z 2025-05-07T20:26:00.4659216Z 2025-05-07T20:26:00.4659255Z 2025-05-07T20:26:00.4659259Z 2025-05-07T20:26:00.4661009Z 2025-05-07T20:26:00.4760897Z libcusolver-11.7.1.2 | 95.8 MB | 3 | 3%  2025-05-07T20:26:00.5035718Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:26:00.5035977Z 2025-05-07T20:26:00.5035981Z 2025-05-07T20:26:00.5035985Z 2025-05-07T20:26:00.5035989Z 2025-05-07T20:26:00.5035993Z 2025-05-07T20:26:00.5088969Z cuda-nvvp-12.6.80 | 109.3 MB | #2 | 12%  2025-05-07T20:26:00.5094436Z 2025-05-07T20:26:00.5659308Z libcublas-12.6.4.1 | 256.2 MB | ######## | 81%  2025-05-07T20:26:00.5659590Z 2025-05-07T20:26:00.5659594Z 2025-05-07T20:26:00.5659597Z 2025-05-07T20:26:00.5659602Z 2025-05-07T20:26:00.5659605Z 2025-05-07T20:26:00.5660337Z 2025-05-07T20:26:00.5885444Z libcusolver-11.7.1.2 | 95.8 MB | 6 | 6%  2025-05-07T20:26:00.6040262Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:26:00.6040528Z 2025-05-07T20:26:00.6040569Z 2025-05-07T20:26:00.6040574Z 2025-05-07T20:26:00.6040577Z 2025-05-07T20:26:00.6044702Z 2025-05-07T20:26:00.6337269Z cuda-nvvp-12.6.80 | 109.3 MB | #5 | 15%  2025-05-07T20:26:00.6337552Z 2025-05-07T20:26:00.6662000Z libcublas-12.6.4.1 | 256.2 MB | ########2 | 82%  2025-05-07T20:26:00.6662366Z 2025-05-07T20:26:00.6662372Z 2025-05-07T20:26:00.6662377Z 2025-05-07T20:26:00.6662382Z 2025-05-07T20:26:00.6662387Z 2025-05-07T20:26:00.6664749Z 2025-05-07T20:26:00.7042124Z libcusolver-11.7.1.2 | 95.8 MB | 9 | 9%  2025-05-07T20:26:00.7083688Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:26:00.7083953Z 2025-05-07T20:26:00.7083957Z 2025-05-07T20:26:00.7083960Z 2025-05-07T20:26:00.7083964Z 2025-05-07T20:26:00.7086845Z 2025-05-07T20:26:00.7482071Z cuda-nvvp-12.6.80 | 109.3 MB | #7 | 18%  2025-05-07T20:26:00.7483126Z 2025-05-07T20:26:00.7665196Z libcublas-12.6.4.1 | 256.2 MB | ########3 | 84%  2025-05-07T20:26:00.7665642Z 2025-05-07T20:26:00.7665646Z 2025-05-07T20:26:00.7665650Z 2025-05-07T20:26:00.7665653Z 2025-05-07T20:26:00.7665657Z 2025-05-07T20:26:00.7667758Z 2025-05-07T20:26:00.8145701Z libcusolver-11.7.1.2 | 95.8 MB | #2 | 12%  2025-05-07T20:26:00.8274926Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:26:00.8275185Z 2025-05-07T20:26:00.8275197Z 2025-05-07T20:26:00.8275201Z 2025-05-07T20:26:00.8275205Z 2025-05-07T20:26:00.8279070Z 2025-05-07T20:26:00.8666536Z cuda-nvvp-12.6.80 | 109.3 MB | ## | 21%  2025-05-07T20:26:00.8666841Z 2025-05-07T20:26:00.8666845Z 2025-05-07T20:26:00.8666849Z 2025-05-07T20:26:00.8666853Z 2025-05-07T20:26:00.8666856Z 2025-05-07T20:26:00.8666860Z 2025-05-07T20:26:00.8673409Z libcusolver-11.7.1.2 | 95.8 MB | #5 | 15%  2025-05-07T20:26:00.8673864Z 2025-05-07T20:26:00.9317840Z libcublas-12.6.4.1 | 256.2 MB | ########4 | 85%  2025-05-07T20:26:00.9371619Z nsight-compute-2024. | 443.1 MB | ##### | 51% 2025-05-07T20:26:00.9371892Z 2025-05-07T20:26:00.9371896Z 2025-05-07T20:26:00.9371900Z 2025-05-07T20:26:00.9371903Z 2025-05-07T20:26:00.9371939Z 2025-05-07T20:26:00.9673384Z cuda-nvvp-12.6.80 | 109.3 MB | ##3 | 23%  2025-05-07T20:26:00.9673679Z 2025-05-07T20:26:00.9673683Z 2025-05-07T20:26:00.9673686Z 2025-05-07T20:26:00.9673690Z 2025-05-07T20:26:00.9673694Z 2025-05-07T20:26:00.9674392Z 2025-05-07T20:26:00.9790537Z libcusolver-11.7.1.2 | 95.8 MB | #8 | 18%  2025-05-07T20:26:00.9790861Z 2025-05-07T20:26:01.0322233Z libcublas-12.6.4.1 | 256.2 MB | ########6 | 86%  2025-05-07T20:26:01.0397183Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:26:01.0397445Z 2025-05-07T20:26:01.0397501Z 2025-05-07T20:26:01.0397505Z 2025-05-07T20:26:01.0397658Z 2025-05-07T20:26:01.0398584Z 2025-05-07T20:26:01.0684261Z cuda-nvvp-12.6.80 | 109.3 MB | ##5 | 26%  2025-05-07T20:26:01.0684670Z 2025-05-07T20:26:01.0684677Z 2025-05-07T20:26:01.0684684Z 2025-05-07T20:26:01.0684690Z 2025-05-07T20:26:01.0684697Z 2025-05-07T20:26:01.0684812Z 2025-05-07T20:26:01.0958190Z libcusolver-11.7.1.2 | 95.8 MB | ##1 | 21%  2025-05-07T20:26:01.0958636Z 2025-05-07T20:26:01.1327032Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 88%  2025-05-07T20:26:01.1398868Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:26:01.1399122Z 2025-05-07T20:26:01.1399250Z 2025-05-07T20:26:01.1399254Z 2025-05-07T20:26:01.1399271Z 2025-05-07T20:26:01.1399377Z 2025-05-07T20:26:01.1686675Z cuda-nvvp-12.6.80 | 109.3 MB | ##8 | 29%  2025-05-07T20:26:01.1687109Z 2025-05-07T20:26:01.1687115Z 2025-05-07T20:26:01.1687121Z 2025-05-07T20:26:01.1687126Z 2025-05-07T20:26:01.1687131Z 2025-05-07T20:26:01.1689657Z 2025-05-07T20:26:01.2013138Z libcusolver-11.7.1.2 | 95.8 MB | ##4 | 25%  2025-05-07T20:26:01.2013606Z 2025-05-07T20:26:01.2399875Z libcublas-12.6.4.1 | 256.2 MB | ########8 | 89%  2025-05-07T20:26:01.2409480Z nsight-compute-2024. | 443.1 MB | #####2 | 53% 2025-05-07T20:26:01.2409732Z 2025-05-07T20:26:01.2409771Z 2025-05-07T20:26:01.2409775Z 2025-05-07T20:26:01.2409778Z 2025-05-07T20:26:01.2410219Z 2025-05-07T20:26:01.2689688Z cuda-nvvp-12.6.80 | 109.3 MB | ###1 | 32%  2025-05-07T20:26:01.2689980Z 2025-05-07T20:26:01.2690126Z 2025-05-07T20:26:01.2690137Z 2025-05-07T20:26:01.2690140Z 2025-05-07T20:26:01.2690144Z 2025-05-07T20:26:01.2692927Z 2025-05-07T20:26:01.3048617Z libcusolver-11.7.1.2 | 95.8 MB | ##7 | 28%  2025-05-07T20:26:01.3050923Z 2025-05-07T20:26:01.3404032Z libcublas-12.6.4.1 | 256.2 MB | ######### | 90%  2025-05-07T20:26:01.3420691Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:26:01.3421027Z 2025-05-07T20:26:01.3421383Z 2025-05-07T20:26:01.3421570Z 2025-05-07T20:26:01.3421577Z 2025-05-07T20:26:01.3425504Z 2025-05-07T20:26:01.3692515Z cuda-nvvp-12.6.80 | 109.3 MB | ###4 | 34%  2025-05-07T20:26:01.3692797Z 2025-05-07T20:26:01.3693239Z 2025-05-07T20:26:01.3693245Z 2025-05-07T20:26:01.3693249Z 2025-05-07T20:26:01.3693252Z 2025-05-07T20:26:01.3694148Z 2025-05-07T20:26:01.4052603Z libcusolver-11.7.1.2 | 95.8 MB | ###1 | 31%  2025-05-07T20:26:01.4054091Z 2025-05-07T20:26:01.4438346Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 91%  2025-05-07T20:26:01.4438605Z 2025-05-07T20:26:01.4438609Z 2025-05-07T20:26:01.4438613Z 2025-05-07T20:26:01.4438617Z 2025-05-07T20:26:01.4441655Z 2025-05-07T20:26:01.4473921Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 37%  2025-05-07T20:26:01.4693236Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:26:01.4693493Z 2025-05-07T20:26:01.4693497Z 2025-05-07T20:26:01.4693501Z 2025-05-07T20:26:01.4693542Z 2025-05-07T20:26:01.4693546Z 2025-05-07T20:26:01.4696616Z 2025-05-07T20:26:01.5053588Z libcusolver-11.7.1.2 | 95.8 MB | ###4 | 35%  2025-05-07T20:26:01.5054466Z 2025-05-07T20:26:01.5512038Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 93%  2025-05-07T20:26:01.5512457Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:26:01.5512696Z 2025-05-07T20:26:01.5512700Z 2025-05-07T20:26:01.5512704Z 2025-05-07T20:26:01.5512708Z 2025-05-07T20:26:01.5512971Z 2025-05-07T20:26:01.6058446Z cuda-nvvp-12.6.80 | 109.3 MB | ###9 | 40%  2025-05-07T20:26:01.6059033Z 2025-05-07T20:26:01.6235173Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 94%  2025-05-07T20:26:01.6235432Z 2025-05-07T20:26:01.6235436Z 2025-05-07T20:26:01.6235440Z 2025-05-07T20:26:01.6235444Z 2025-05-07T20:26:01.6235447Z 2025-05-07T20:26:01.6241483Z 2025-05-07T20:26:01.6513334Z libcusolver-11.7.1.2 | 95.8 MB | ###7 | 38%  2025-05-07T20:26:01.6513693Z 2025-05-07T20:26:01.6513698Z 2025-05-07T20:26:01.6513702Z 2025-05-07T20:26:01.6513705Z 2025-05-07T20:26:01.6513709Z 2025-05-07T20:26:01.7060380Z cuda-nvvp-12.6.80 | 109.3 MB | ####2 | 43%  2025-05-07T20:26:01.7062946Z 2025-05-07T20:26:01.7238224Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 95%  2025-05-07T20:26:01.7238490Z 2025-05-07T20:26:01.7238494Z 2025-05-07T20:26:01.7238497Z 2025-05-07T20:26:01.7238501Z 2025-05-07T20:26:01.7238505Z 2025-05-07T20:26:01.7238992Z 2025-05-07T20:26:01.7516710Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 42%  2025-05-07T20:26:01.7517025Z 2025-05-07T20:26:01.7517029Z 2025-05-07T20:26:01.7517033Z 2025-05-07T20:26:01.7517037Z 2025-05-07T20:26:01.7517043Z 2025-05-07T20:26:01.7599332Z cuda-nvvp-12.6.80 | 109.3 MB | ####5 | 46%  2025-05-07T20:26:01.8324182Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:26:01.8327183Z 2025-05-07T20:26:01.8414621Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:26:01.8415001Z 2025-05-07T20:26:01.8415007Z 2025-05-07T20:26:01.8415010Z 2025-05-07T20:26:01.8415023Z 2025-05-07T20:26:01.8415027Z 2025-05-07T20:26:01.8415031Z 2025-05-07T20:26:01.8519021Z libcusolver-11.7.1.2 | 95.8 MB | ####5 | 45%  2025-05-07T20:26:01.8519325Z 2025-05-07T20:26:01.8519329Z 2025-05-07T20:26:01.8519340Z 2025-05-07T20:26:01.8519343Z 2025-05-07T20:26:01.8519347Z 2025-05-07T20:26:01.8608975Z cuda-nvvp-12.6.80 | 109.3 MB | ####8 | 49%  2025-05-07T20:26:01.9325440Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:26:01.9328533Z 2025-05-07T20:26:01.9419283Z libcublas-12.6.4.1 | 256.2 MB | #########7 | 98%  2025-05-07T20:26:01.9429300Z 2025-05-07T20:26:01.9429305Z 2025-05-07T20:26:01.9429309Z 2025-05-07T20:26:01.9429313Z 2025-05-07T20:26:01.9429317Z 2025-05-07T20:26:01.9429396Z 2025-05-07T20:26:01.9604345Z libcusolver-11.7.1.2 | 95.8 MB | ####8 | 48%  2025-05-07T20:26:01.9604823Z 2025-05-07T20:26:01.9604827Z 2025-05-07T20:26:01.9604831Z 2025-05-07T20:26:01.9604834Z 2025-05-07T20:26:01.9604838Z 2025-05-07T20:26:01.9611839Z cuda-nvvp-12.6.80 | 109.3 MB | #####1 | 52%  2025-05-07T20:26:02.0417992Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:26:02.0418722Z 2025-05-07T20:26:02.0429003Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 99%  2025-05-07T20:26:02.0429261Z 2025-05-07T20:26:02.0429265Z 2025-05-07T20:26:02.0429270Z 2025-05-07T20:26:02.0429287Z 2025-05-07T20:26:02.0429292Z 2025-05-07T20:26:02.0429324Z 2025-05-07T20:26:02.0613597Z libcusolver-11.7.1.2 | 95.8 MB | #####1 | 51%  2025-05-07T20:26:02.0671033Z nsight-compute-2024. | 443.1 MB | #####7 | 57% 2025-05-07T20:26:02.0671478Z 2025-05-07T20:26:02.0671482Z 2025-05-07T20:26:02.0671486Z 2025-05-07T20:26:02.0671489Z 2025-05-07T20:26:02.0672499Z 2025-05-07T20:26:02.1431019Z cuda-nvvp-12.6.80 | 109.3 MB | #####4 | 55%  2025-05-07T20:26:02.1431317Z 2025-05-07T20:26:02.1431321Z 2025-05-07T20:26:02.1431325Z 2025-05-07T20:26:02.1431329Z 2025-05-07T20:26:02.1431332Z 2025-05-07T20:26:02.1433766Z 2025-05-07T20:26:02.1652203Z libcusolver-11.7.1.2 | 95.8 MB | #####4 | 54%  2025-05-07T20:26:02.1674811Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:26:02.1675060Z 2025-05-07T20:26:02.1675064Z 2025-05-07T20:26:02.1675068Z 2025-05-07T20:26:02.1675071Z 2025-05-07T20:26:02.1677906Z 2025-05-07T20:26:02.2432485Z cuda-nvvp-12.6.80 | 109.3 MB | #####7 | 58%  2025-05-07T20:26:02.2432941Z 2025-05-07T20:26:02.2432948Z 2025-05-07T20:26:02.2432954Z 2025-05-07T20:26:02.2432959Z 2025-05-07T20:26:02.2432964Z 2025-05-07T20:26:02.2432970Z 2025-05-07T20:26:02.2653622Z libcusolver-11.7.1.2 | 95.8 MB | #####7 | 58%  2025-05-07T20:26:02.2676568Z nsight-compute-2024. | 443.1 MB | #####8 | 59% 2025-05-07T20:26:02.2676855Z 2025-05-07T20:26:02.2676859Z 2025-05-07T20:26:02.2676863Z 2025-05-07T20:26:02.2676867Z 2025-05-07T20:26:02.2681138Z 2025-05-07T20:26:02.3118255Z cuda-nvvp-12.6.80 | 109.3 MB | ###### | 61%  2025-05-07T20:26:02.3118550Z 2025-05-07T20:26:02.3123613Z 2025-05-07T20:26:02.3439322Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:02.3439609Z 2025-05-07T20:26:02.3439620Z 2025-05-07T20:26:02.3439625Z 2025-05-07T20:26:02.3439630Z 2025-05-07T20:26:02.3439634Z 2025-05-07T20:26:02.3442108Z 2025-05-07T20:26:02.3664088Z libcusolver-11.7.1.2 | 95.8 MB | ######1 | 61%  2025-05-07T20:26:02.3829610Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:26:02.3829939Z 2025-05-07T20:26:02.3829943Z 2025-05-07T20:26:02.3829947Z 2025-05-07T20:26:02.3829950Z 2025-05-07T20:26:02.3829954Z 2025-05-07T20:26:02.3906409Z cuda-nvvp-12.6.80 | 109.3 MB | ######3 | 63%  2025-05-07T20:26:02.3906849Z 2025-05-07T20:26:02.3906855Z 2025-05-07T20:26:02.3906860Z 2025-05-07T20:26:02.3906865Z 2025-05-07T20:26:02.3906871Z 2025-05-07T20:26:02.3906874Z 2025-05-07T20:26:02.3906878Z 2025-05-07T20:26:02.4518876Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:02.4519187Z 2025-05-07T20:26:02.4519191Z 2025-05-07T20:26:02.4519194Z 2025-05-07T20:26:02.4519198Z 2025-05-07T20:26:02.4519202Z 2025-05-07T20:26:02.4519205Z 2025-05-07T20:26:02.4864358Z libcusolver-11.7.1.2 | 95.8 MB | ######4 | 64%  2025-05-07T20:26:02.4907644Z nsight-compute-2024. | 443.1 MB | ###### | 60% 2025-05-07T20:26:02.4908497Z 2025-05-07T20:26:02.4908805Z 2025-05-07T20:26:02.4908835Z 2025-05-07T20:26:02.4908838Z 2025-05-07T20:26:02.4908842Z 2025-05-07T20:26:02.4908961Z 2025-05-07T20:26:02.4908965Z 2025-05-07T20:26:02.5028370Z libnpp-12.3.1.54 | 93.4 MB | 3 | 3%  2025-05-07T20:26:02.5029364Z 2025-05-07T20:26:02.5029372Z 2025-05-07T20:26:02.5029378Z 2025-05-07T20:26:02.5029383Z 2025-05-07T20:26:02.5029389Z 2025-05-07T20:26:02.5651261Z cuda-nvvp-12.6.80 | 109.3 MB | ######6 | 66%  2025-05-07T20:26:02.5651571Z 2025-05-07T20:26:02.5651578Z 2025-05-07T20:26:02.5651583Z 2025-05-07T20:26:02.5651589Z 2025-05-07T20:26:02.5651592Z 2025-05-07T20:26:02.5651596Z 2025-05-07T20:26:02.5909439Z libcusolver-11.7.1.2 | 95.8 MB | ######7 | 67%  2025-05-07T20:26:02.5909861Z 2025-05-07T20:26:02.5909865Z 2025-05-07T20:26:02.5909869Z 2025-05-07T20:26:02.5909872Z 2025-05-07T20:26:02.5909877Z 2025-05-07T20:26:02.5909881Z 2025-05-07T20:26:02.5909886Z 2025-05-07T20:26:02.5927424Z libnpp-12.3.1.54 | 93.4 MB | 5 | 6%  2025-05-07T20:26:02.6139289Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:26:02.6139648Z 2025-05-07T20:26:02.6139654Z 2025-05-07T20:26:02.6139659Z 2025-05-07T20:26:02.6139694Z 2025-05-07T20:26:02.6141725Z 2025-05-07T20:26:02.6827131Z cuda-nvvp-12.6.80 | 109.3 MB | ######8 | 69%  2025-05-07T20:26:02.6827512Z 2025-05-07T20:26:02.6827517Z 2025-05-07T20:26:02.6827523Z 2025-05-07T20:26:02.6827528Z 2025-05-07T20:26:02.6827533Z 2025-05-07T20:26:02.6827539Z 2025-05-07T20:26:02.6913251Z libcusolver-11.7.1.2 | 95.8 MB | ####### | 70%  2025-05-07T20:26:02.6913620Z 2025-05-07T20:26:02.6913624Z 2025-05-07T20:26:02.6913812Z 2025-05-07T20:26:02.6913819Z 2025-05-07T20:26:02.6913825Z 2025-05-07T20:26:02.6913830Z 2025-05-07T20:26:02.6920876Z 2025-05-07T20:26:02.7039240Z libnpp-12.3.1.54 | 93.4 MB | 8 | 9%  2025-05-07T20:26:02.7202012Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:26:02.7202368Z 2025-05-07T20:26:02.7202372Z 2025-05-07T20:26:02.7202376Z 2025-05-07T20:26:02.7202380Z 2025-05-07T20:26:02.7204295Z 2025-05-07T20:26:02.7836171Z cuda-nvvp-12.6.80 | 109.3 MB | #######1 | 71%  2025-05-07T20:26:02.7836483Z 2025-05-07T20:26:02.7836487Z 2025-05-07T20:26:02.7836490Z 2025-05-07T20:26:02.7836494Z 2025-05-07T20:26:02.7836506Z 2025-05-07T20:26:02.7842621Z 2025-05-07T20:26:02.7919540Z libcusolver-11.7.1.2 | 95.8 MB | #######3 | 73%  2025-05-07T20:26:02.7919950Z 2025-05-07T20:26:02.7919953Z 2025-05-07T20:26:02.7919966Z 2025-05-07T20:26:02.7919970Z 2025-05-07T20:26:02.7919973Z 2025-05-07T20:26:02.7919977Z 2025-05-07T20:26:02.7919981Z 2025-05-07T20:26:02.8174471Z libnpp-12.3.1.54 | 93.4 MB | #1 | 12%  2025-05-07T20:26:02.8302144Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:26:02.8302451Z 2025-05-07T20:26:02.8302457Z 2025-05-07T20:26:02.8302462Z 2025-05-07T20:26:02.8302467Z 2025-05-07T20:26:02.8306032Z 2025-05-07T20:26:02.8923766Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 74%  2025-05-07T20:26:02.8924111Z 2025-05-07T20:26:02.8924144Z 2025-05-07T20:26:02.8924164Z 2025-05-07T20:26:02.8924168Z 2025-05-07T20:26:02.8924171Z 2025-05-07T20:26:02.8924175Z 2025-05-07T20:26:02.8931250Z 2025-05-07T20:26:02.8953692Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:26:02.8953993Z 2025-05-07T20:26:02.8953996Z 2025-05-07T20:26:02.8954000Z 2025-05-07T20:26:02.8954004Z 2025-05-07T20:26:02.8954007Z 2025-05-07T20:26:02.8954011Z 2025-05-07T20:26:02.9186565Z libcusolver-11.7.1.2 | 95.8 MB | #######6 | 76%  2025-05-07T20:26:02.9476915Z nsight-compute-2024. | 443.1 MB | ######2 | 62% 2025-05-07T20:26:02.9477186Z 2025-05-07T20:26:02.9477190Z 2025-05-07T20:26:02.9477194Z 2025-05-07T20:26:02.9479210Z 2025-05-07T20:26:02.9479216Z 2025-05-07T20:26:02.9938586Z cuda-nvvp-12.6.80 | 109.3 MB | #######6 | 76%  2025-05-07T20:26:02.9939017Z 2025-05-07T20:26:02.9939023Z 2025-05-07T20:26:02.9939028Z 2025-05-07T20:26:02.9939033Z 2025-05-07T20:26:02.9939038Z 2025-05-07T20:26:02.9939513Z 2025-05-07T20:26:02.9939519Z 2025-05-07T20:26:02.9958314Z libnpp-12.3.1.54 | 93.4 MB | #7 | 17%  2025-05-07T20:26:02.9958598Z 2025-05-07T20:26:02.9958602Z 2025-05-07T20:26:02.9958606Z 2025-05-07T20:26:02.9958610Z 2025-05-07T20:26:02.9958613Z 2025-05-07T20:26:02.9958617Z 2025-05-07T20:26:03.0190558Z libcusolver-11.7.1.2 | 95.8 MB | #######9 | 79%  2025-05-07T20:26:03.0593725Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:26:03.0593987Z 2025-05-07T20:26:03.0593991Z 2025-05-07T20:26:03.0594005Z 2025-05-07T20:26:03.0594009Z 2025-05-07T20:26:03.0599548Z 2025-05-07T20:26:03.0944468Z cuda-nvvp-12.6.80 | 109.3 MB | #######8 | 79%  2025-05-07T20:26:03.0944824Z 2025-05-07T20:26:03.0944831Z 2025-05-07T20:26:03.0944836Z 2025-05-07T20:26:03.0944841Z 2025-05-07T20:26:03.0944846Z 2025-05-07T20:26:03.0944851Z 2025-05-07T20:26:03.0946739Z 2025-05-07T20:26:03.0999440Z libnpp-12.3.1.54 | 93.4 MB | ## | 20%  2025-05-07T20:26:03.0999764Z 2025-05-07T20:26:03.0999769Z 2025-05-07T20:26:03.0999773Z 2025-05-07T20:26:03.0999779Z 2025-05-07T20:26:03.0999788Z 2025-05-07T20:26:03.0999810Z 2025-05-07T20:26:03.1192905Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 82%  2025-05-07T20:26:03.1637778Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:26:03.1638055Z 2025-05-07T20:26:03.1638059Z 2025-05-07T20:26:03.1638063Z 2025-05-07T20:26:03.1638067Z 2025-05-07T20:26:03.1641593Z 2025-05-07T20:26:03.1960053Z cuda-nvvp-12.6.80 | 109.3 MB | ########1 | 81%  2025-05-07T20:26:03.1960344Z 2025-05-07T20:26:03.1960349Z 2025-05-07T20:26:03.1960352Z 2025-05-07T20:26:03.1960356Z 2025-05-07T20:26:03.1960360Z 2025-05-07T20:26:03.1960364Z 2025-05-07T20:26:03.1962649Z 2025-05-07T20:26:03.2122308Z libnpp-12.3.1.54 | 93.4 MB | ##2 | 23%  2025-05-07T20:26:03.2122594Z 2025-05-07T20:26:03.2122631Z 2025-05-07T20:26:03.2122635Z 2025-05-07T20:26:03.2122639Z 2025-05-07T20:26:03.2122642Z 2025-05-07T20:26:03.2122646Z 2025-05-07T20:26:03.2204525Z libcusolver-11.7.1.2 | 95.8 MB | ########5 | 85%  2025-05-07T20:26:03.2637920Z nsight-compute-2024. | 443.1 MB | ######4 | 64% 2025-05-07T20:26:03.2638189Z 2025-05-07T20:26:03.2638193Z 2025-05-07T20:26:03.2638197Z 2025-05-07T20:26:03.2638201Z 2025-05-07T20:26:03.2640442Z 2025-05-07T20:26:03.3007072Z cuda-nvvp-12.6.80 | 109.3 MB | ########3 | 83%  2025-05-07T20:26:03.3007366Z 2025-05-07T20:26:03.3007370Z 2025-05-07T20:26:03.3007374Z 2025-05-07T20:26:03.3007377Z 2025-05-07T20:26:03.3007381Z 2025-05-07T20:26:03.3007385Z 2025-05-07T20:26:03.3008135Z 2025-05-07T20:26:03.3168618Z libnpp-12.3.1.54 | 93.4 MB | ##5 | 26%  2025-05-07T20:26:03.3168896Z 2025-05-07T20:26:03.3168900Z 2025-05-07T20:26:03.3168904Z 2025-05-07T20:26:03.3175054Z 2025-05-07T20:26:03.3183129Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:03.3183402Z 2025-05-07T20:26:03.3183407Z 2025-05-07T20:26:03.3183410Z 2025-05-07T20:26:03.3183414Z 2025-05-07T20:26:03.3183418Z 2025-05-07T20:26:03.3183421Z 2025-05-07T20:26:03.3217825Z libcusolver-11.7.1.2 | 95.8 MB | ########7 | 88%  2025-05-07T20:26:03.3809416Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:26:03.3809706Z 2025-05-07T20:26:03.3809710Z 2025-05-07T20:26:03.3809714Z 2025-05-07T20:26:03.3809720Z 2025-05-07T20:26:03.3814085Z 2025-05-07T20:26:03.4009664Z cuda-nvvp-12.6.80 | 109.3 MB | ########5 | 86%  2025-05-07T20:26:03.4009961Z 2025-05-07T20:26:03.4009965Z 2025-05-07T20:26:03.4009969Z 2025-05-07T20:26:03.4009972Z 2025-05-07T20:26:03.4009976Z 2025-05-07T20:26:03.4009979Z 2025-05-07T20:26:03.4011008Z 2025-05-07T20:26:03.4185521Z libnpp-12.3.1.54 | 93.4 MB | ##8 | 29%  2025-05-07T20:26:03.4186265Z 2025-05-07T20:26:03.4186496Z 2025-05-07T20:26:03.4186505Z 2025-05-07T20:26:03.4186512Z 2025-05-07T20:26:03.4186519Z 2025-05-07T20:26:03.4186574Z 2025-05-07T20:26:03.4224824Z libcusolver-11.7.1.2 | 95.8 MB | ######### | 91%  2025-05-07T20:26:03.4812413Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:26:03.4812687Z 2025-05-07T20:26:03.4812698Z 2025-05-07T20:26:03.4812702Z 2025-05-07T20:26:03.4812706Z 2025-05-07T20:26:03.4812710Z 2025-05-07T20:26:03.5011029Z cuda-nvvp-12.6.80 | 109.3 MB | ########8 | 88%  2025-05-07T20:26:03.5011314Z 2025-05-07T20:26:03.5011324Z 2025-05-07T20:26:03.5011327Z 2025-05-07T20:26:03.5011331Z 2025-05-07T20:26:03.5011334Z 2025-05-07T20:26:03.5011339Z 2025-05-07T20:26:03.5011342Z 2025-05-07T20:26:03.5187284Z libnpp-12.3.1.54 | 93.4 MB | ###1 | 31%  2025-05-07T20:26:03.5187575Z 2025-05-07T20:26:03.5187579Z 2025-05-07T20:26:03.5187583Z 2025-05-07T20:26:03.5187611Z 2025-05-07T20:26:03.5187625Z 2025-05-07T20:26:03.5187629Z 2025-05-07T20:26:03.5229226Z libcusolver-11.7.1.2 | 95.8 MB | #########3 | 94%  2025-05-07T20:26:03.5816007Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:26:03.5816377Z 2025-05-07T20:26:03.5816382Z 2025-05-07T20:26:03.5816387Z 2025-05-07T20:26:03.5816393Z 2025-05-07T20:26:03.5816409Z 2025-05-07T20:26:03.6011598Z cuda-nvvp-12.6.80 | 109.3 MB | ######### | 91%  2025-05-07T20:26:03.6011979Z 2025-05-07T20:26:03.6011984Z 2025-05-07T20:26:03.6011989Z 2025-05-07T20:26:03.6012005Z 2025-05-07T20:26:03.6012012Z 2025-05-07T20:26:03.6012017Z 2025-05-07T20:26:03.6012022Z 2025-05-07T20:26:03.6189862Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 34%  2025-05-07T20:26:03.6190238Z 2025-05-07T20:26:03.6190773Z 2025-05-07T20:26:03.6190779Z 2025-05-07T20:26:03.6190784Z 2025-05-07T20:26:03.6190789Z 2025-05-07T20:26:03.6190962Z 2025-05-07T20:26:03.6230302Z libcusolver-11.7.1.2 | 95.8 MB | #########7 | 97%  2025-05-07T20:26:03.6820078Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:26:03.6820361Z 2025-05-07T20:26:03.6820366Z 2025-05-07T20:26:03.6820372Z 2025-05-07T20:26:03.6820377Z 2025-05-07T20:26:03.6824588Z 2025-05-07T20:26:03.7018095Z cuda-nvvp-12.6.80 | 109.3 MB | #########3 | 94%  2025-05-07T20:26:03.7018389Z 2025-05-07T20:26:03.7018395Z 2025-05-07T20:26:03.7018400Z 2025-05-07T20:26:03.7018413Z 2025-05-07T20:26:03.7018417Z 2025-05-07T20:26:03.7018423Z 2025-05-07T20:26:03.7018428Z 2025-05-07T20:26:03.7232588Z libnpp-12.3.1.54 | 93.4 MB | ###7 | 38%  2025-05-07T20:26:03.7823151Z nsight-compute-2024. | 443.1 MB | ######7 | 67% 2025-05-07T20:26:03.7823517Z 2025-05-07T20:26:03.7823524Z 2025-05-07T20:26:03.7823530Z 2025-05-07T20:26:03.7823536Z 2025-05-07T20:26:03.7826628Z 2025-05-07T20:26:03.8019937Z cuda-nvvp-12.6.80 | 109.3 MB | #########6 | 97%  2025-05-07T20:26:03.8020306Z 2025-05-07T20:26:03.8020312Z 2025-05-07T20:26:03.8020317Z 2025-05-07T20:26:03.8020322Z 2025-05-07T20:26:03.8020327Z 2025-05-07T20:26:03.8020333Z 2025-05-07T20:26:03.8020338Z 2025-05-07T20:26:03.8237343Z libnpp-12.3.1.54 | 93.4 MB | ####1 | 41%  2025-05-07T20:26:03.8824533Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:26:03.8824913Z 2025-05-07T20:26:03.8824917Z 2025-05-07T20:26:03.8824920Z 2025-05-07T20:26:03.8824924Z 2025-05-07T20:26:03.8824928Z 2025-05-07T20:26:03.9023647Z cuda-nvvp-12.6.80 | 109.3 MB | #########9 | 100%  2025-05-07T20:26:03.9024037Z 2025-05-07T20:26:03.9024043Z 2025-05-07T20:26:03.9024048Z 2025-05-07T20:26:03.9024053Z 2025-05-07T20:26:03.9024058Z 2025-05-07T20:26:03.9024063Z 2025-05-07T20:26:03.9025410Z 2025-05-07T20:26:03.9242385Z libnpp-12.3.1.54 | 93.4 MB | ####4 | 45%  2025-05-07T20:26:04.0024376Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:26:04.0024828Z 2025-05-07T20:26:04.0024832Z 2025-05-07T20:26:04.0024836Z 2025-05-07T20:26:04.0024840Z 2025-05-07T20:26:04.0024843Z 2025-05-07T20:26:04.0024847Z 2025-05-07T20:26:04.0026290Z 2025-05-07T20:26:04.0249211Z libnpp-12.3.1.54 | 93.4 MB | ####8 | 49%  2025-05-07T20:26:04.1024125Z nsight-compute-2024. | 443.1 MB | ######9 | 70% 2025-05-07T20:26:04.1024411Z 2025-05-07T20:26:04.1024416Z 2025-05-07T20:26:04.1024419Z 2025-05-07T20:26:04.1024423Z 2025-05-07T20:26:04.1024427Z 2025-05-07T20:26:04.1024432Z 2025-05-07T20:26:04.1024879Z 2025-05-07T20:26:04.1253247Z libnpp-12.3.1.54 | 93.4 MB | #####2 | 52%  2025-05-07T20:26:04.2027233Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:26:04.2027544Z 2025-05-07T20:26:04.2027548Z 2025-05-07T20:26:04.2027552Z 2025-05-07T20:26:04.2027556Z 2025-05-07T20:26:04.2027560Z 2025-05-07T20:26:04.2027564Z 2025-05-07T20:26:04.2027606Z 2025-05-07T20:26:04.2255628Z libnpp-12.3.1.54 | 93.4 MB | #####6 | 56%  2025-05-07T20:26:04.3042748Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:26:04.3043038Z 2025-05-07T20:26:04.3043042Z 2025-05-07T20:26:04.3043045Z 2025-05-07T20:26:04.3043049Z 2025-05-07T20:26:04.3043053Z 2025-05-07T20:26:04.3043056Z 2025-05-07T20:26:04.3043060Z 2025-05-07T20:26:04.3259681Z libnpp-12.3.1.54 | 93.4 MB | ###### | 60%  2025-05-07T20:26:04.4052323Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:26:04.4052690Z 2025-05-07T20:26:04.4052696Z 2025-05-07T20:26:04.4052701Z 2025-05-07T20:26:04.4052715Z 2025-05-07T20:26:04.4052721Z 2025-05-07T20:26:04.4052726Z 2025-05-07T20:26:04.4052731Z 2025-05-07T20:26:04.4380203Z libnpp-12.3.1.54 | 93.4 MB | ######3 | 64%  2025-05-07T20:26:04.5053025Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:26:04.5053332Z 2025-05-07T20:26:04.5053352Z 2025-05-07T20:26:04.5053358Z 2025-05-07T20:26:04.5053363Z 2025-05-07T20:26:04.5053368Z 2025-05-07T20:26:04.5053373Z 2025-05-07T20:26:04.5057592Z 2025-05-07T20:26:04.5382776Z libnpp-12.3.1.54 | 93.4 MB | ######7 | 68%  2025-05-07T20:26:04.6123263Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:26:04.6123543Z 2025-05-07T20:26:04.6123548Z 2025-05-07T20:26:04.6123552Z 2025-05-07T20:26:04.6123555Z 2025-05-07T20:26:04.6123559Z 2025-05-07T20:26:04.6123563Z 2025-05-07T20:26:04.6124001Z 2025-05-07T20:26:04.6383257Z libnpp-12.3.1.54 | 93.4 MB | #######1 | 72%  2025-05-07T20:26:04.7127188Z nsight-compute-2024. | 443.1 MB | #######4 | 75% 2025-05-07T20:26:04.7127543Z 2025-05-07T20:26:04.7127548Z 2025-05-07T20:26:04.7127552Z 2025-05-07T20:26:04.7127555Z 2025-05-07T20:26:04.7127560Z 2025-05-07T20:26:04.7127564Z 2025-05-07T20:26:04.7129600Z 2025-05-07T20:26:04.7388476Z libnpp-12.3.1.54 | 93.4 MB | #######5 | 76%  2025-05-07T20:26:04.8167833Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:26:04.8168151Z 2025-05-07T20:26:04.8168157Z 2025-05-07T20:26:04.8168162Z 2025-05-07T20:26:04.8168167Z 2025-05-07T20:26:04.8168173Z 2025-05-07T20:26:04.8168178Z 2025-05-07T20:26:04.8170185Z 2025-05-07T20:26:04.8410531Z libnpp-12.3.1.54 | 93.4 MB | #######9 | 79%  2025-05-07T20:26:04.9170162Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:26:04.9170432Z 2025-05-07T20:26:04.9170436Z 2025-05-07T20:26:04.9170439Z 2025-05-07T20:26:04.9170443Z 2025-05-07T20:26:04.9170446Z 2025-05-07T20:26:04.9170452Z 2025-05-07T20:26:04.9171728Z 2025-05-07T20:26:04.9489755Z libnpp-12.3.1.54 | 93.4 MB | ########3 | 83%  2025-05-07T20:26:05.0172495Z nsight-compute-2024. | 443.1 MB | #######7 | 77% 2025-05-07T20:26:05.0172784Z 2025-05-07T20:26:05.0172790Z 2025-05-07T20:26:05.0173090Z 2025-05-07T20:26:05.0173262Z 2025-05-07T20:26:05.0173268Z 2025-05-07T20:26:05.0173273Z 2025-05-07T20:26:05.0173308Z 2025-05-07T20:26:05.0490717Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 87%  2025-05-07T20:26:05.1173728Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:26:05.1174110Z 2025-05-07T20:26:05.1174116Z 2025-05-07T20:26:05.1174122Z 2025-05-07T20:26:05.1174127Z 2025-05-07T20:26:05.1174132Z 2025-05-07T20:26:05.1174137Z 2025-05-07T20:26:05.1176878Z 2025-05-07T20:26:05.1491889Z libnpp-12.3.1.54 | 93.4 MB | #########1 | 91%  2025-05-07T20:26:05.2179141Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:26:05.2179487Z 2025-05-07T20:26:05.2179494Z 2025-05-07T20:26:05.2179499Z 2025-05-07T20:26:05.2179506Z 2025-05-07T20:26:05.2179512Z 2025-05-07T20:26:05.2179518Z 2025-05-07T20:26:05.2179523Z 2025-05-07T20:26:05.2497685Z libnpp-12.3.1.54 | 93.4 MB | #########5 | 95%  2025-05-07T20:26:05.3181550Z nsight-compute-2024. | 443.1 MB | #######9 | 80% 2025-05-07T20:26:05.3181905Z 2025-05-07T20:26:05.3181912Z 2025-05-07T20:26:05.3181918Z 2025-05-07T20:26:05.3181923Z 2025-05-07T20:26:05.3181928Z 2025-05-07T20:26:05.3181933Z 2025-05-07T20:26:05.3181938Z 2025-05-07T20:26:05.3586616Z libnpp-12.3.1.54 | 93.4 MB | #########8 | 99%  2025-05-07T20:26:05.4586496Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:26:05.5591359Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:26:05.6595668Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:26:05.7604154Z nsight-compute-2024. | 443.1 MB | ########3 | 83% 2025-05-07T20:26:05.8605511Z nsight-compute-2024. | 443.1 MB | ########4 | 84% 2025-05-07T20:26:05.9619093Z nsight-compute-2024. | 443.1 MB | ########5 | 85% 2025-05-07T20:26:06.0653240Z nsight-compute-2024. | 443.1 MB | ########6 | 86% 2025-05-07T20:26:06.1653595Z nsight-compute-2024. | 443.1 MB | ########7 | 87% 2025-05-07T20:26:06.2655200Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:26:06.4358831Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:26:06.4490516Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:26:06.4490871Z 2025-05-07T20:26:06.4490877Z 2025-05-07T20:26:06.4490882Z 2025-05-07T20:26:06.4490888Z 2025-05-07T20:26:06.4490893Z 2025-05-07T20:26:06.4495419Z 2025-05-07T20:26:06.5074203Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:06.5074507Z 2025-05-07T20:26:06.5074511Z 2025-05-07T20:26:06.5074515Z 2025-05-07T20:26:06.5074520Z 2025-05-07T20:26:06.5074525Z 2025-05-07T20:26:06.5074529Z 2025-05-07T20:26:06.5074545Z 2025-05-07T20:26:06.5080796Z 2025-05-07T20:26:06.5441975Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:06.6077453Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:26:06.6077742Z 2025-05-07T20:26:06.6077788Z 2025-05-07T20:26:06.6077792Z 2025-05-07T20:26:06.6077796Z 2025-05-07T20:26:06.6077799Z 2025-05-07T20:26:06.6077803Z 2025-05-07T20:26:06.6077807Z 2025-05-07T20:26:06.6080467Z 2025-05-07T20:26:06.6446470Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 7%  2025-05-07T20:26:06.7282406Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:26:06.7282700Z 2025-05-07T20:26:06.7282704Z 2025-05-07T20:26:06.7282708Z 2025-05-07T20:26:06.7282711Z 2025-05-07T20:26:06.7282715Z 2025-05-07T20:26:06.7282720Z 2025-05-07T20:26:06.7282723Z 2025-05-07T20:26:06.7282727Z 2025-05-07T20:26:06.7631976Z cuda-nvdisasm-12.6.7 | 47.6 MB | #3 | 13%  2025-05-07T20:26:06.8290655Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:26:06.8290956Z 2025-05-07T20:26:06.8290961Z 2025-05-07T20:26:06.8290979Z 2025-05-07T20:26:06.8290984Z 2025-05-07T20:26:06.8290990Z 2025-05-07T20:26:06.8290995Z 2025-05-07T20:26:06.8291477Z 2025-05-07T20:26:06.8294862Z 2025-05-07T20:26:06.8760734Z cuda-nvdisasm-12.6.7 | 47.6 MB | #9 | 20%  2025-05-07T20:26:06.9011154Z nsight-compute-2024. | 443.1 MB | #########3 | 93% 2025-05-07T20:26:06.9011449Z 2025-05-07T20:26:06.9011455Z 2025-05-07T20:26:06.9011460Z 2025-05-07T20:26:06.9011465Z 2025-05-07T20:26:06.9013406Z 2025-05-07T20:26:06.9290916Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:06.9291274Z 2025-05-07T20:26:06.9291278Z 2025-05-07T20:26:06.9291281Z 2025-05-07T20:26:06.9291285Z 2025-05-07T20:26:06.9291289Z 2025-05-07T20:26:06.9291294Z 2025-05-07T20:26:06.9291298Z 2025-05-07T20:26:06.9291645Z 2025-05-07T20:26:06.9460604Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##6 | 26%  2025-05-07T20:26:06.9460995Z 2025-05-07T20:26:06.9461000Z 2025-05-07T20:26:06.9461004Z 2025-05-07T20:26:06.9461017Z 2025-05-07T20:26:06.9461021Z 2025-05-07T20:26:06.9461024Z 2025-05-07T20:26:06.9461069Z 2025-05-07T20:26:06.9461074Z 2025-05-07T20:26:06.9463253Z 2025-05-07T20:26:06.9850099Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:07.0406483Z nsight-compute-2024. | 443.1 MB | #########3 | 94% 2025-05-07T20:26:07.0406763Z 2025-05-07T20:26:07.0406768Z 2025-05-07T20:26:07.0406771Z 2025-05-07T20:26:07.0406775Z 2025-05-07T20:26:07.0406788Z 2025-05-07T20:26:07.0406792Z 2025-05-07T20:26:07.0406795Z 2025-05-07T20:26:07.0407712Z 2025-05-07T20:26:07.0465456Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###2 | 32%  2025-05-07T20:26:07.0465908Z 2025-05-07T20:26:07.0465915Z 2025-05-07T20:26:07.0465920Z 2025-05-07T20:26:07.0465925Z 2025-05-07T20:26:07.0465931Z 2025-05-07T20:26:07.0465936Z 2025-05-07T20:26:07.0465941Z 2025-05-07T20:26:07.0465946Z 2025-05-07T20:26:07.0467649Z 2025-05-07T20:26:07.0971467Z libcurand-10.3.7.77 | 39.9 MB | 7 | 7%  2025-05-07T20:26:07.1412334Z nsight-compute-2024. | 443.1 MB | #########4 | 95% 2025-05-07T20:26:07.1412643Z 2025-05-07T20:26:07.1412647Z 2025-05-07T20:26:07.1412650Z 2025-05-07T20:26:07.1412654Z 2025-05-07T20:26:07.1412658Z 2025-05-07T20:26:07.1412661Z 2025-05-07T20:26:07.1412665Z 2025-05-07T20:26:07.1424272Z 2025-05-07T20:26:07.1491907Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###8 | 39%  2025-05-07T20:26:07.1492236Z 2025-05-07T20:26:07.1492240Z 2025-05-07T20:26:07.1492244Z 2025-05-07T20:26:07.1492248Z 2025-05-07T20:26:07.1492252Z 2025-05-07T20:26:07.1492255Z 2025-05-07T20:26:07.1492259Z 2025-05-07T20:26:07.1492263Z 2025-05-07T20:26:07.1494360Z 2025-05-07T20:26:07.2203197Z libcurand-10.3.7.77 | 39.9 MB | #4 | 14%  2025-05-07T20:26:07.2483310Z nsight-compute-2024. | 443.1 MB | #########5 | 95% 2025-05-07T20:26:07.2483644Z 2025-05-07T20:26:07.2483649Z 2025-05-07T20:26:07.2483652Z 2025-05-07T20:26:07.2483656Z 2025-05-07T20:26:07.2483660Z 2025-05-07T20:26:07.2483706Z 2025-05-07T20:26:07.2483710Z 2025-05-07T20:26:07.2484996Z 2025-05-07T20:26:07.2498045Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####4 | 45%  2025-05-07T20:26:07.2498666Z 2025-05-07T20:26:07.2498672Z 2025-05-07T20:26:07.2498678Z 2025-05-07T20:26:07.2498683Z 2025-05-07T20:26:07.2498689Z 2025-05-07T20:26:07.2498694Z 2025-05-07T20:26:07.2498699Z 2025-05-07T20:26:07.2498705Z 2025-05-07T20:26:07.2499967Z 2025-05-07T20:26:07.3361699Z libcurand-10.3.7.77 | 39.9 MB | ##1 | 22%  2025-05-07T20:26:07.3500084Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:26:07.3500390Z 2025-05-07T20:26:07.3500394Z 2025-05-07T20:26:07.3500397Z 2025-05-07T20:26:07.3500401Z 2025-05-07T20:26:07.3500405Z 2025-05-07T20:26:07.3500408Z 2025-05-07T20:26:07.3500412Z 2025-05-07T20:26:07.3500416Z 2025-05-07T20:26:07.3502287Z 2025-05-07T20:26:07.4254424Z libcurand-10.3.7.77 | 39.9 MB | ### | 30%  2025-05-07T20:26:07.4254935Z 2025-05-07T20:26:07.4254939Z 2025-05-07T20:26:07.4254943Z 2025-05-07T20:26:07.4254946Z 2025-05-07T20:26:07.4254950Z 2025-05-07T20:26:07.4254954Z 2025-05-07T20:26:07.4254957Z 2025-05-07T20:26:07.4256451Z 2025-05-07T20:26:07.4365983Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##### | 51%  2025-05-07T20:26:07.4505817Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:26:07.4506095Z 2025-05-07T20:26:07.4506102Z 2025-05-07T20:26:07.4506108Z 2025-05-07T20:26:07.4506115Z 2025-05-07T20:26:07.4506119Z 2025-05-07T20:26:07.4506122Z 2025-05-07T20:26:07.4506126Z 2025-05-07T20:26:07.4506130Z 2025-05-07T20:26:07.4508434Z 2025-05-07T20:26:07.5259160Z libcurand-10.3.7.77 | 39.9 MB | ###8 | 38%  2025-05-07T20:26:07.5259466Z 2025-05-07T20:26:07.5259469Z 2025-05-07T20:26:07.5259473Z 2025-05-07T20:26:07.5259477Z 2025-05-07T20:26:07.5259481Z 2025-05-07T20:26:07.5259485Z 2025-05-07T20:26:07.5259516Z 2025-05-07T20:26:07.5261699Z 2025-05-07T20:26:07.5366061Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####5 | 56%  2025-05-07T20:26:07.5602290Z nsight-compute-2024. | 443.1 MB | #########7 | 97% 2025-05-07T20:26:07.5602648Z 2025-05-07T20:26:07.5602653Z 2025-05-07T20:26:07.5602656Z 2025-05-07T20:26:07.5602661Z 2025-05-07T20:26:07.5602664Z 2025-05-07T20:26:07.5602668Z 2025-05-07T20:26:07.5602672Z 2025-05-07T20:26:07.5602676Z 2025-05-07T20:26:07.5608895Z 2025-05-07T20:26:07.6266433Z libcurand-10.3.7.77 | 39.9 MB | ####6 | 46%  2025-05-07T20:26:07.6266742Z 2025-05-07T20:26:07.6266745Z 2025-05-07T20:26:07.6266749Z 2025-05-07T20:26:07.6266752Z 2025-05-07T20:26:07.6266756Z 2025-05-07T20:26:07.6266760Z 2025-05-07T20:26:07.6266763Z 2025-05-07T20:26:07.6266767Z 2025-05-07T20:26:07.6480071Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######1 | 61%  2025-05-07T20:26:07.6610489Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:26:07.6610851Z 2025-05-07T20:26:07.6610861Z 2025-05-07T20:26:07.6611022Z 2025-05-07T20:26:07.6611052Z 2025-05-07T20:26:07.6611058Z 2025-05-07T20:26:07.6611063Z 2025-05-07T20:26:07.6611068Z 2025-05-07T20:26:07.6611107Z 2025-05-07T20:26:07.6611204Z 2025-05-07T20:26:07.7270642Z libcurand-10.3.7.77 | 39.9 MB | #####4 | 54%  2025-05-07T20:26:07.7270948Z 2025-05-07T20:26:07.7270951Z 2025-05-07T20:26:07.7270955Z 2025-05-07T20:26:07.7270959Z 2025-05-07T20:26:07.7270962Z 2025-05-07T20:26:07.7270966Z 2025-05-07T20:26:07.7270979Z 2025-05-07T20:26:07.7271785Z 2025-05-07T20:26:07.7484865Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######6 | 67%  2025-05-07T20:26:07.7686787Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:26:07.7687098Z 2025-05-07T20:26:07.7687290Z 2025-05-07T20:26:07.7687294Z 2025-05-07T20:26:07.7687298Z 2025-05-07T20:26:07.7687302Z 2025-05-07T20:26:07.7687305Z 2025-05-07T20:26:07.7687337Z 2025-05-07T20:26:07.7687369Z 2025-05-07T20:26:07.7687379Z 2025-05-07T20:26:07.8331370Z libcurand-10.3.7.77 | 39.9 MB | ######2 | 62%  2025-05-07T20:26:07.8331666Z 2025-05-07T20:26:07.8331670Z 2025-05-07T20:26:07.8331673Z 2025-05-07T20:26:07.8331689Z 2025-05-07T20:26:07.8331692Z 2025-05-07T20:26:07.8331697Z 2025-05-07T20:26:07.8331700Z 2025-05-07T20:26:07.8333090Z 2025-05-07T20:26:07.8678031Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######1 | 72%  2025-05-07T20:26:07.8836105Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:26:07.8836398Z 2025-05-07T20:26:07.8836404Z 2025-05-07T20:26:07.8836409Z 2025-05-07T20:26:07.8836414Z 2025-05-07T20:26:07.8836419Z 2025-05-07T20:26:07.8836439Z 2025-05-07T20:26:07.8836444Z 2025-05-07T20:26:07.8836449Z 2025-05-07T20:26:07.8836632Z 2025-05-07T20:26:07.9345002Z libcurand-10.3.7.77 | 39.9 MB | ######9 | 70%  2025-05-07T20:26:07.9345298Z 2025-05-07T20:26:07.9345704Z 2025-05-07T20:26:07.9345709Z 2025-05-07T20:26:07.9345713Z 2025-05-07T20:26:07.9345728Z 2025-05-07T20:26:07.9345732Z 2025-05-07T20:26:07.9345735Z 2025-05-07T20:26:07.9345743Z 2025-05-07T20:26:07.9764058Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######7 | 77%  2025-05-07T20:26:07.9841390Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:26:07.9841750Z 2025-05-07T20:26:07.9841756Z 2025-05-07T20:26:07.9841761Z 2025-05-07T20:26:07.9841766Z 2025-05-07T20:26:07.9841772Z 2025-05-07T20:26:07.9841778Z 2025-05-07T20:26:07.9841794Z 2025-05-07T20:26:07.9841799Z 2025-05-07T20:26:07.9841946Z 2025-05-07T20:26:08.0349294Z libcurand-10.3.7.77 | 39.9 MB | #######7 | 77%  2025-05-07T20:26:08.0349627Z 2025-05-07T20:26:08.0349640Z 2025-05-07T20:26:08.0349644Z 2025-05-07T20:26:08.0349648Z 2025-05-07T20:26:08.0349652Z 2025-05-07T20:26:08.0349658Z 2025-05-07T20:26:08.0349663Z 2025-05-07T20:26:08.0349667Z 2025-05-07T20:26:08.0881775Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########3 | 83%  2025-05-07T20:26:08.0882125Z 2025-05-07T20:26:08.0882129Z 2025-05-07T20:26:08.0882133Z 2025-05-07T20:26:08.0882136Z 2025-05-07T20:26:08.0882140Z 2025-05-07T20:26:08.0882144Z 2025-05-07T20:26:08.0882147Z 2025-05-07T20:26:08.0882151Z 2025-05-07T20:26:08.0886527Z 2025-05-07T20:26:08.1350590Z libcurand-10.3.7.77 | 39.9 MB | ########4 | 84%  2025-05-07T20:26:08.1350932Z 2025-05-07T20:26:08.1350936Z 2025-05-07T20:26:08.1350940Z 2025-05-07T20:26:08.1350944Z 2025-05-07T20:26:08.1350947Z 2025-05-07T20:26:08.1350951Z 2025-05-07T20:26:08.1350954Z 2025-05-07T20:26:08.1352438Z 2025-05-07T20:26:08.1908127Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######### | 90%  2025-05-07T20:26:08.1908469Z 2025-05-07T20:26:08.1908473Z 2025-05-07T20:26:08.1908477Z 2025-05-07T20:26:08.1908484Z 2025-05-07T20:26:08.1908489Z 2025-05-07T20:26:08.1908494Z 2025-05-07T20:26:08.1908527Z 2025-05-07T20:26:08.1908543Z 2025-05-07T20:26:08.1908547Z 2025-05-07T20:26:08.2356442Z libcurand-10.3.7.77 | 39.9 MB | #########1 | 92%  2025-05-07T20:26:08.2356780Z 2025-05-07T20:26:08.2356784Z 2025-05-07T20:26:08.2356788Z 2025-05-07T20:26:08.2356791Z 2025-05-07T20:26:08.2356795Z 2025-05-07T20:26:08.2356799Z 2025-05-07T20:26:08.2356811Z 2025-05-07T20:26:08.2356815Z 2025-05-07T20:26:08.2828347Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########6 | 96%  2025-05-07T20:26:08.2828803Z 2025-05-07T20:26:08.2828809Z 2025-05-07T20:26:08.2828815Z 2025-05-07T20:26:08.2828829Z 2025-05-07T20:26:08.2828835Z 2025-05-07T20:26:08.2828840Z 2025-05-07T20:26:08.2831313Z 2025-05-07T20:26:08.3514866Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:08.3515302Z 2025-05-07T20:26:08.3515307Z 2025-05-07T20:26:08.3515310Z 2025-05-07T20:26:08.3515315Z 2025-05-07T20:26:08.3515320Z 2025-05-07T20:26:08.3515324Z 2025-05-07T20:26:08.3515371Z 2025-05-07T20:26:08.3515375Z 2025-05-07T20:26:08.3515378Z 2025-05-07T20:26:08.3517392Z 2025-05-07T20:26:08.4464994Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:08.4465305Z 2025-05-07T20:26:08.4465309Z 2025-05-07T20:26:08.4468200Z 2025-05-07T20:26:08.4514581Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:08.4515064Z 2025-05-07T20:26:08.4515070Z 2025-05-07T20:26:08.4515076Z 2025-05-07T20:26:08.4515082Z 2025-05-07T20:26:08.4515088Z 2025-05-07T20:26:08.4515093Z 2025-05-07T20:26:08.4515100Z 2025-05-07T20:26:08.4515106Z 2025-05-07T20:26:08.4515111Z 2025-05-07T20:26:08.4517998Z 2025-05-07T20:26:08.5520716Z gds-tools-1.11.1.6 | 37.8 MB | 8 | 8%  2025-05-07T20:26:08.5521039Z 2025-05-07T20:26:08.5521043Z 2025-05-07T20:26:08.5521054Z 2025-05-07T20:26:08.5521058Z 2025-05-07T20:26:08.5521061Z 2025-05-07T20:26:08.5521065Z 2025-05-07T20:26:08.5521344Z 2025-05-07T20:26:08.5521533Z 2025-05-07T20:26:08.5521539Z 2025-05-07T20:26:08.5521544Z 2025-05-07T20:26:08.6538695Z gds-tools-1.11.1.6 | 37.8 MB | #7 | 18%  2025-05-07T20:26:08.6539153Z 2025-05-07T20:26:08.6539159Z 2025-05-07T20:26:08.6539164Z 2025-05-07T20:26:08.6539168Z 2025-05-07T20:26:08.6539174Z 2025-05-07T20:26:08.6539179Z 2025-05-07T20:26:08.6539184Z 2025-05-07T20:26:08.6539189Z 2025-05-07T20:26:08.6539194Z 2025-05-07T20:26:08.6544676Z 2025-05-07T20:26:08.7542184Z gds-tools-1.11.1.6 | 37.8 MB | ##6 | 26%  2025-05-07T20:26:08.7542648Z 2025-05-07T20:26:08.7542655Z 2025-05-07T20:26:08.7542661Z 2025-05-07T20:26:08.7542668Z 2025-05-07T20:26:08.7542674Z 2025-05-07T20:26:08.7542681Z 2025-05-07T20:26:08.7542687Z 2025-05-07T20:26:08.7542693Z 2025-05-07T20:26:08.7542701Z 2025-05-07T20:26:08.7542707Z 2025-05-07T20:26:08.8612282Z gds-tools-1.11.1.6 | 37.8 MB | ###5 | 36%  2025-05-07T20:26:08.8612722Z 2025-05-07T20:26:08.8612727Z 2025-05-07T20:26:08.8612730Z 2025-05-07T20:26:08.8612734Z 2025-05-07T20:26:08.8612738Z 2025-05-07T20:26:08.8612741Z 2025-05-07T20:26:08.8612745Z 2025-05-07T20:26:08.8612759Z 2025-05-07T20:26:08.8612762Z 2025-05-07T20:26:08.8612766Z 2025-05-07T20:26:08.9704634Z gds-tools-1.11.1.6 | 37.8 MB | ####5 | 45%  2025-05-07T20:26:08.9704963Z 2025-05-07T20:26:08.9704967Z 2025-05-07T20:26:08.9704971Z 2025-05-07T20:26:08.9704974Z 2025-05-07T20:26:08.9704978Z 2025-05-07T20:26:08.9704982Z 2025-05-07T20:26:08.9704985Z 2025-05-07T20:26:08.9704989Z 2025-05-07T20:26:08.9704993Z 2025-05-07T20:26:08.9704996Z 2025-05-07T20:26:09.0774775Z gds-tools-1.11.1.6 | 37.8 MB | #####3 | 54%  2025-05-07T20:26:09.0775101Z 2025-05-07T20:26:09.0775105Z 2025-05-07T20:26:09.0775109Z 2025-05-07T20:26:09.0775113Z 2025-05-07T20:26:09.0775116Z 2025-05-07T20:26:09.0775120Z 2025-05-07T20:26:09.0775164Z 2025-05-07T20:26:09.0775168Z 2025-05-07T20:26:09.0775171Z 2025-05-07T20:26:09.0778657Z 2025-05-07T20:26:09.1825204Z gds-tools-1.11.1.6 | 37.8 MB | ######2 | 63%  2025-05-07T20:26:09.1825531Z 2025-05-07T20:26:09.1825535Z 2025-05-07T20:26:09.1825549Z 2025-05-07T20:26:09.1825552Z 2025-05-07T20:26:09.1825559Z 2025-05-07T20:26:09.1825562Z 2025-05-07T20:26:09.1825567Z 2025-05-07T20:26:09.1825571Z 2025-05-07T20:26:09.1825575Z 2025-05-07T20:26:09.1825579Z 2025-05-07T20:26:09.2827203Z gds-tools-1.11.1.6 | 37.8 MB | #######1 | 71%  2025-05-07T20:26:09.2827538Z 2025-05-07T20:26:09.2827542Z 2025-05-07T20:26:09.2827546Z 2025-05-07T20:26:09.2827550Z 2025-05-07T20:26:09.2827554Z 2025-05-07T20:26:09.2827558Z 2025-05-07T20:26:09.2827562Z 2025-05-07T20:26:09.2827566Z 2025-05-07T20:26:09.2827569Z 2025-05-07T20:26:09.2827573Z 2025-05-07T20:26:09.3835116Z gds-tools-1.11.1.6 | 37.8 MB | ######## | 81%  2025-05-07T20:26:09.3835460Z 2025-05-07T20:26:09.3835464Z 2025-05-07T20:26:09.3835468Z 2025-05-07T20:26:09.3835472Z 2025-05-07T20:26:09.3835475Z 2025-05-07T20:26:09.3835479Z 2025-05-07T20:26:09.3835483Z 2025-05-07T20:26:09.3835487Z 2025-05-07T20:26:09.3835490Z 2025-05-07T20:26:09.3835494Z 2025-05-07T20:26:09.5895477Z gds-tools-1.11.1.6 | 37.8 MB | ######### | 91%  2025-05-07T20:26:09.5896004Z 2025-05-07T20:26:09.5896011Z 2025-05-07T20:26:09.5896016Z 2025-05-07T20:26:09.5896021Z 2025-05-07T20:26:09.5896026Z 2025-05-07T20:26:09.5896032Z 2025-05-07T20:26:09.5896037Z 2025-05-07T20:26:09.5896042Z 2025-05-07T20:26:09.5896048Z 2025-05-07T20:26:09.5896451Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:09.5896871Z 2025-05-07T20:26:09.5896879Z 2025-05-07T20:26:09.5896885Z 2025-05-07T20:26:09.5896891Z 2025-05-07T20:26:09.5896898Z 2025-05-07T20:26:09.5896905Z 2025-05-07T20:26:09.5897208Z 2025-05-07T20:26:09.5897403Z 2025-05-07T20:26:09.5897419Z 2025-05-07T20:26:09.6530276Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:09.6530747Z 2025-05-07T20:26:09.6530753Z 2025-05-07T20:26:09.6530758Z 2025-05-07T20:26:09.6530764Z 2025-05-07T20:26:09.6530769Z 2025-05-07T20:26:09.6530774Z 2025-05-07T20:26:09.6530780Z 2025-05-07T20:26:09.6530785Z 2025-05-07T20:26:09.6530792Z 2025-05-07T20:26:09.6530798Z 2025-05-07T20:26:09.6530803Z 2025-05-07T20:26:09.7540937Z python-3.13.0 | 31.5 MB | | 0%  2025-05-07T20:26:09.7541255Z 2025-05-07T20:26:09.7541259Z 2025-05-07T20:26:09.7541263Z 2025-05-07T20:26:09.7541267Z 2025-05-07T20:26:09.7541272Z 2025-05-07T20:26:09.7541276Z 2025-05-07T20:26:09.7541279Z 2025-05-07T20:26:09.7541283Z 2025-05-07T20:26:09.7541287Z 2025-05-07T20:26:09.7541300Z 2025-05-07T20:26:09.7550186Z 2025-05-07T20:26:09.8548608Z python-3.13.0 | 31.5 MB | #3 | 13%  2025-05-07T20:26:09.8549093Z 2025-05-07T20:26:09.8549111Z 2025-05-07T20:26:09.8549118Z 2025-05-07T20:26:09.8549123Z 2025-05-07T20:26:09.8549129Z 2025-05-07T20:26:09.8549134Z 2025-05-07T20:26:09.8549140Z 2025-05-07T20:26:09.8549145Z 2025-05-07T20:26:09.8549150Z 2025-05-07T20:26:09.8549155Z 2025-05-07T20:26:09.8549160Z 2025-05-07T20:26:09.8746021Z python-3.13.0 | 31.5 MB | ##6 | 26%  2025-05-07T20:26:09.8746332Z 2025-05-07T20:26:09.8746337Z 2025-05-07T20:26:09.8746340Z 2025-05-07T20:26:09.8746344Z 2025-05-07T20:26:09.8746347Z 2025-05-07T20:26:09.8746351Z 2025-05-07T20:26:09.8746355Z 2025-05-07T20:26:09.8746522Z 2025-05-07T20:26:09.9347202Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:09.9347608Z 2025-05-07T20:26:09.9347612Z 2025-05-07T20:26:09.9347617Z 2025-05-07T20:26:09.9347620Z 2025-05-07T20:26:09.9347625Z 2025-05-07T20:26:09.9347642Z 2025-05-07T20:26:09.9347687Z 2025-05-07T20:26:09.9347693Z 2025-05-07T20:26:09.9347698Z 2025-05-07T20:26:09.9347704Z 2025-05-07T20:26:09.9347709Z 2025-05-07T20:26:09.9347714Z 2025-05-07T20:26:09.9571047Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:09.9571381Z 2025-05-07T20:26:09.9571385Z 2025-05-07T20:26:09.9571389Z 2025-05-07T20:26:09.9571392Z 2025-05-07T20:26:09.9571396Z 2025-05-07T20:26:09.9571400Z 2025-05-07T20:26:09.9571404Z 2025-05-07T20:26:09.9571407Z 2025-05-07T20:26:09.9571411Z 2025-05-07T20:26:09.9571415Z 2025-05-07T20:26:09.9571418Z 2025-05-07T20:26:10.0352221Z python-3.13.0 | 31.5 MB | ###9 | 39%  2025-05-07T20:26:10.0352739Z 2025-05-07T20:26:10.0352746Z 2025-05-07T20:26:10.0352751Z 2025-05-07T20:26:10.0352756Z 2025-05-07T20:26:10.0352761Z 2025-05-07T20:26:10.0352766Z 2025-05-07T20:26:10.0352771Z 2025-05-07T20:26:10.0352777Z 2025-05-07T20:26:10.0352782Z 2025-05-07T20:26:10.0352840Z 2025-05-07T20:26:10.0352845Z 2025-05-07T20:26:10.0355587Z 2025-05-07T20:26:10.0768193Z cuda-nvcc-tools-12.6 | 23.0 MB | #3 | 13%  2025-05-07T20:26:10.0768697Z 2025-05-07T20:26:10.0768703Z 2025-05-07T20:26:10.0768708Z 2025-05-07T20:26:10.0768713Z 2025-05-07T20:26:10.0768719Z 2025-05-07T20:26:10.0768724Z 2025-05-07T20:26:10.0768729Z 2025-05-07T20:26:10.0768734Z 2025-05-07T20:26:10.0768739Z 2025-05-07T20:26:10.0768746Z 2025-05-07T20:26:10.0769397Z 2025-05-07T20:26:10.1361980Z python-3.13.0 | 31.5 MB | #####2 | 52%  2025-05-07T20:26:10.1362295Z 2025-05-07T20:26:10.1362299Z 2025-05-07T20:26:10.1362303Z 2025-05-07T20:26:10.1362306Z 2025-05-07T20:26:10.1362310Z 2025-05-07T20:26:10.1362314Z 2025-05-07T20:26:10.1362318Z 2025-05-07T20:26:10.1362336Z 2025-05-07T20:26:10.1362340Z 2025-05-07T20:26:10.1362343Z 2025-05-07T20:26:10.1362347Z 2025-05-07T20:26:10.1364429Z 2025-05-07T20:26:10.2081643Z cuda-nvcc-tools-12.6 | 23.0 MB | ##6 | 26%  2025-05-07T20:26:10.2082199Z 2025-05-07T20:26:10.2082204Z 2025-05-07T20:26:10.2082207Z 2025-05-07T20:26:10.2082211Z 2025-05-07T20:26:10.2082215Z 2025-05-07T20:26:10.2082218Z 2025-05-07T20:26:10.2082222Z 2025-05-07T20:26:10.2082226Z 2025-05-07T20:26:10.2082229Z 2025-05-07T20:26:10.2082233Z 2025-05-07T20:26:10.2082237Z 2025-05-07T20:26:10.2451160Z python-3.13.0 | 31.5 MB | ######4 | 65%  2025-05-07T20:26:10.2451486Z 2025-05-07T20:26:10.2451490Z 2025-05-07T20:26:10.2451494Z 2025-05-07T20:26:10.2451497Z 2025-05-07T20:26:10.2451501Z 2025-05-07T20:26:10.2451505Z 2025-05-07T20:26:10.2451508Z 2025-05-07T20:26:10.2451512Z 2025-05-07T20:26:10.2451516Z 2025-05-07T20:26:10.2451519Z 2025-05-07T20:26:10.2451523Z 2025-05-07T20:26:10.2452305Z 2025-05-07T20:26:10.2962472Z cuda-nvcc-tools-12.6 | 23.0 MB | ###8 | 39%  2025-05-07T20:26:10.2962903Z 2025-05-07T20:26:10.2967320Z 2025-05-07T20:26:10.3290220Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:10.3290614Z 2025-05-07T20:26:10.3290621Z 2025-05-07T20:26:10.3290625Z 2025-05-07T20:26:10.3290628Z 2025-05-07T20:26:10.3290632Z 2025-05-07T20:26:10.3290636Z 2025-05-07T20:26:10.3290639Z 2025-05-07T20:26:10.3290644Z 2025-05-07T20:26:10.3290662Z 2025-05-07T20:26:10.3290668Z 2025-05-07T20:26:10.3293901Z 2025-05-07T20:26:10.3451191Z python-3.13.0 | 31.5 MB | #######5 | 76%  2025-05-07T20:26:10.3451546Z 2025-05-07T20:26:10.3451554Z 2025-05-07T20:26:10.3451559Z 2025-05-07T20:26:10.3451564Z 2025-05-07T20:26:10.3451570Z 2025-05-07T20:26:10.3451574Z 2025-05-07T20:26:10.3451578Z 2025-05-07T20:26:10.3451582Z 2025-05-07T20:26:10.3451586Z 2025-05-07T20:26:10.3451589Z 2025-05-07T20:26:10.3451594Z 2025-05-07T20:26:10.3455170Z 2025-05-07T20:26:10.4476538Z cuda-nvcc-tools-12.6 | 23.0 MB | #####1 | 51%  2025-05-07T20:26:10.4477008Z 2025-05-07T20:26:10.4477014Z 2025-05-07T20:26:10.4477020Z 2025-05-07T20:26:10.4477026Z 2025-05-07T20:26:10.4477031Z 2025-05-07T20:26:10.4477036Z 2025-05-07T20:26:10.4477041Z 2025-05-07T20:26:10.4477046Z 2025-05-07T20:26:10.4477051Z 2025-05-07T20:26:10.4477056Z 2025-05-07T20:26:10.4477075Z 2025-05-07T20:26:10.4481652Z 2025-05-07T20:26:10.4486808Z cuda-nvcc-tools-12.6 | 23.0 MB | ######3 | 64%  2025-05-07T20:26:10.4487296Z 2025-05-07T20:26:10.4487312Z 2025-05-07T20:26:10.4487316Z 2025-05-07T20:26:10.4487324Z 2025-05-07T20:26:10.4487331Z 2025-05-07T20:26:10.4487337Z 2025-05-07T20:26:10.4487344Z 2025-05-07T20:26:10.4487350Z 2025-05-07T20:26:10.4487356Z 2025-05-07T20:26:10.4487363Z 2025-05-07T20:26:10.4489409Z 2025-05-07T20:26:10.5477605Z python-3.13.0 | 31.5 MB | ########6 | 86%  2025-05-07T20:26:10.5478065Z 2025-05-07T20:26:10.5478103Z 2025-05-07T20:26:10.5478122Z 2025-05-07T20:26:10.5478128Z 2025-05-07T20:26:10.5478133Z 2025-05-07T20:26:10.5478138Z 2025-05-07T20:26:10.5478143Z 2025-05-07T20:26:10.5478149Z 2025-05-07T20:26:10.5478154Z 2025-05-07T20:26:10.5478159Z 2025-05-07T20:26:10.5478164Z 2025-05-07T20:26:10.5479629Z 2025-05-07T20:26:10.5488404Z cuda-nvcc-tools-12.6 | 23.0 MB | #######6 | 76%  2025-05-07T20:26:10.5488830Z 2025-05-07T20:26:10.5488834Z 2025-05-07T20:26:10.5488838Z 2025-05-07T20:26:10.5488841Z 2025-05-07T20:26:10.5488845Z 2025-05-07T20:26:10.5488848Z 2025-05-07T20:26:10.5488852Z 2025-05-07T20:26:10.5488856Z 2025-05-07T20:26:10.5488869Z 2025-05-07T20:26:10.5488873Z 2025-05-07T20:26:10.5493534Z 2025-05-07T20:26:10.6479109Z python-3.13.0 | 31.5 MB | #########6 | 97%  2025-05-07T20:26:10.6479490Z 2025-05-07T20:26:10.6479494Z 2025-05-07T20:26:10.6479497Z 2025-05-07T20:26:10.6479501Z 2025-05-07T20:26:10.6479505Z 2025-05-07T20:26:10.6479932Z 2025-05-07T20:26:10.6479936Z 2025-05-07T20:26:10.6479940Z 2025-05-07T20:26:10.6479944Z 2025-05-07T20:26:10.6479947Z 2025-05-07T20:26:10.6479951Z 2025-05-07T20:26:10.6481397Z 2025-05-07T20:26:10.8195008Z cuda-nvcc-tools-12.6 | 23.0 MB | ######### | 91%  2025-05-07T20:26:10.8195411Z 2025-05-07T20:26:10.8195415Z 2025-05-07T20:26:10.8195419Z 2025-05-07T20:26:10.8195422Z 2025-05-07T20:26:10.8195426Z 2025-05-07T20:26:10.8195430Z 2025-05-07T20:26:10.8195433Z 2025-05-07T20:26:10.8195437Z 2025-05-07T20:26:10.8195441Z 2025-05-07T20:26:10.8195444Z 2025-05-07T20:26:10.8799437Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:10.8799824Z 2025-05-07T20:26:10.8799829Z 2025-05-07T20:26:10.8799833Z 2025-05-07T20:26:10.8799837Z 2025-05-07T20:26:10.8799840Z 2025-05-07T20:26:10.8799844Z 2025-05-07T20:26:10.8799848Z 2025-05-07T20:26:10.8799851Z 2025-05-07T20:26:10.8799863Z 2025-05-07T20:26:10.8799905Z 2025-05-07T20:26:10.8799909Z 2025-05-07T20:26:10.8799912Z 2025-05-07T20:26:10.8804465Z 2025-05-07T20:26:10.9803088Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:10.9803446Z 2025-05-07T20:26:10.9803450Z 2025-05-07T20:26:10.9803454Z 2025-05-07T20:26:10.9803458Z 2025-05-07T20:26:10.9803462Z 2025-05-07T20:26:10.9803466Z 2025-05-07T20:26:10.9803478Z 2025-05-07T20:26:10.9803482Z 2025-05-07T20:26:10.9803485Z 2025-05-07T20:26:10.9803489Z 2025-05-07T20:26:10.9803493Z 2025-05-07T20:26:10.9803496Z 2025-05-07T20:26:10.9809141Z 2025-05-07T20:26:10.9971477Z cuda-nvrtc-12.6.85 | 17.3 MB | #9 | 20%  2025-05-07T20:26:10.9977609Z 2025-05-07T20:26:11.0806062Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:11.0806413Z 2025-05-07T20:26:11.0806429Z 2025-05-07T20:26:11.0806434Z 2025-05-07T20:26:11.0806439Z 2025-05-07T20:26:11.0806444Z 2025-05-07T20:26:11.0806503Z 2025-05-07T20:26:11.0806509Z 2025-05-07T20:26:11.0806514Z 2025-05-07T20:26:11.0806519Z 2025-05-07T20:26:11.0806524Z 2025-05-07T20:26:11.0806528Z 2025-05-07T20:26:11.0806534Z 2025-05-07T20:26:11.0808421Z 2025-05-07T20:26:11.1035440Z cuda-nvrtc-12.6.85 | 17.3 MB | ####1 | 41%  2025-05-07T20:26:11.1035838Z 2025-05-07T20:26:11.1035843Z 2025-05-07T20:26:11.1035846Z 2025-05-07T20:26:11.1035850Z 2025-05-07T20:26:11.1035853Z 2025-05-07T20:26:11.1035857Z 2025-05-07T20:26:11.1035860Z 2025-05-07T20:26:11.1035864Z 2025-05-07T20:26:11.1035868Z 2025-05-07T20:26:11.1035871Z 2025-05-07T20:26:11.1035875Z 2025-05-07T20:26:11.1035878Z 2025-05-07T20:26:11.1035882Z 2025-05-07T20:26:11.1037994Z 2025-05-07T20:26:11.1956110Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:11.1956496Z 2025-05-07T20:26:11.1956502Z 2025-05-07T20:26:11.1956507Z 2025-05-07T20:26:11.1956521Z 2025-05-07T20:26:11.1956570Z 2025-05-07T20:26:11.1956577Z 2025-05-07T20:26:11.1956582Z 2025-05-07T20:26:11.1956587Z 2025-05-07T20:26:11.1956593Z 2025-05-07T20:26:11.1956598Z 2025-05-07T20:26:11.1956603Z 2025-05-07T20:26:11.1956609Z 2025-05-07T20:26:11.1959600Z 2025-05-07T20:26:11.2037988Z cuda-nvrtc-12.6.85 | 17.3 MB | ######2 | 62%  2025-05-07T20:26:11.2038303Z 2025-05-07T20:26:11.2038307Z 2025-05-07T20:26:11.2038311Z 2025-05-07T20:26:11.2038314Z 2025-05-07T20:26:11.2038318Z 2025-05-07T20:26:11.2038322Z 2025-05-07T20:26:11.2038325Z 2025-05-07T20:26:11.2038329Z 2025-05-07T20:26:11.2038333Z 2025-05-07T20:26:11.2038336Z 2025-05-07T20:26:11.2038340Z 2025-05-07T20:26:11.2038343Z 2025-05-07T20:26:11.2038347Z 2025-05-07T20:26:11.2039066Z 2025-05-07T20:26:11.3045598Z libnvjitlink-12.6.85 | 14.9 MB | ## | 20%  2025-05-07T20:26:11.3045964Z 2025-05-07T20:26:11.3045969Z 2025-05-07T20:26:11.3045975Z 2025-05-07T20:26:11.3046515Z 2025-05-07T20:26:11.3046520Z 2025-05-07T20:26:11.3046524Z 2025-05-07T20:26:11.3046538Z 2025-05-07T20:26:11.3046542Z 2025-05-07T20:26:11.3046546Z 2025-05-07T20:26:11.3046549Z 2025-05-07T20:26:11.3046553Z 2025-05-07T20:26:11.3046556Z 2025-05-07T20:26:11.3052959Z 2025-05-07T20:26:11.3143540Z cuda-nvrtc-12.6.85 | 17.3 MB | ########1 | 82%  2025-05-07T20:26:11.3143955Z 2025-05-07T20:26:11.3143959Z 2025-05-07T20:26:11.3143963Z 2025-05-07T20:26:11.3143966Z 2025-05-07T20:26:11.3143970Z 2025-05-07T20:26:11.3143973Z 2025-05-07T20:26:11.3143977Z 2025-05-07T20:26:11.3143981Z 2025-05-07T20:26:11.3143984Z 2025-05-07T20:26:11.3143988Z 2025-05-07T20:26:11.3143991Z 2025-05-07T20:26:11.3143995Z 2025-05-07T20:26:11.3143999Z 2025-05-07T20:26:11.3144002Z 2025-05-07T20:26:11.4319270Z libnvjitlink-12.6.85 | 14.9 MB | #### | 40%  2025-05-07T20:26:11.4319625Z 2025-05-07T20:26:11.4319629Z 2025-05-07T20:26:11.4319668Z 2025-05-07T20:26:11.4319672Z 2025-05-07T20:26:11.4319676Z 2025-05-07T20:26:11.4319679Z 2025-05-07T20:26:11.4319683Z 2025-05-07T20:26:11.4319687Z 2025-05-07T20:26:11.4319690Z 2025-05-07T20:26:11.4319694Z 2025-05-07T20:26:11.4319698Z 2025-05-07T20:26:11.4320476Z 2025-05-07T20:26:11.4581404Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:11.4581728Z 2025-05-07T20:26:11.4581732Z 2025-05-07T20:26:11.4581736Z 2025-05-07T20:26:11.4581739Z 2025-05-07T20:26:11.4581743Z 2025-05-07T20:26:11.4581756Z 2025-05-07T20:26:11.4581760Z 2025-05-07T20:26:11.4581763Z 2025-05-07T20:26:11.4581767Z 2025-05-07T20:26:11.4581770Z 2025-05-07T20:26:11.4581774Z 2025-05-07T20:26:11.4581781Z 2025-05-07T20:26:11.4581786Z 2025-05-07T20:26:11.4581790Z 2025-05-07T20:26:11.4927102Z libnvjitlink-12.6.85 | 14.9 MB | #####9 | 59%  2025-05-07T20:26:11.4927571Z 2025-05-07T20:26:11.4927577Z 2025-05-07T20:26:11.4927620Z 2025-05-07T20:26:11.4927626Z 2025-05-07T20:26:11.4927631Z 2025-05-07T20:26:11.4927636Z 2025-05-07T20:26:11.4927642Z 2025-05-07T20:26:11.4927647Z 2025-05-07T20:26:11.4927652Z 2025-05-07T20:26:11.4927657Z 2025-05-07T20:26:11.4927662Z 2025-05-07T20:26:11.4927667Z 2025-05-07T20:26:11.4927673Z 2025-05-07T20:26:11.4927678Z 2025-05-07T20:26:11.4929001Z 2025-05-07T20:26:11.5341487Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:11.5341875Z 2025-05-07T20:26:11.5341881Z 2025-05-07T20:26:11.5341886Z 2025-05-07T20:26:11.5341891Z 2025-05-07T20:26:11.5341896Z 2025-05-07T20:26:11.5341902Z 2025-05-07T20:26:11.5341906Z 2025-05-07T20:26:11.5341922Z 2025-05-07T20:26:11.5341928Z 2025-05-07T20:26:11.5341933Z 2025-05-07T20:26:11.5343722Z 2025-05-07T20:26:11.5586257Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:11.5586647Z 2025-05-07T20:26:11.5586651Z 2025-05-07T20:26:11.5586691Z 2025-05-07T20:26:11.5586695Z 2025-05-07T20:26:11.5586699Z 2025-05-07T20:26:11.5586703Z 2025-05-07T20:26:11.5586706Z 2025-05-07T20:26:11.5586710Z 2025-05-07T20:26:11.5586714Z 2025-05-07T20:26:11.5586717Z 2025-05-07T20:26:11.5586721Z 2025-05-07T20:26:11.5586725Z 2025-05-07T20:26:11.5586728Z 2025-05-07T20:26:11.5589528Z 2025-05-07T20:26:11.5866105Z libnvjitlink-12.6.85 | 14.9 MB | #######8 | 78%  2025-05-07T20:26:11.5866536Z 2025-05-07T20:26:11.5866540Z 2025-05-07T20:26:11.5866543Z 2025-05-07T20:26:11.5866547Z 2025-05-07T20:26:11.5866551Z 2025-05-07T20:26:11.5866555Z 2025-05-07T20:26:11.5866558Z 2025-05-07T20:26:11.5866562Z 2025-05-07T20:26:11.5866566Z 2025-05-07T20:26:11.5866569Z 2025-05-07T20:26:11.5866573Z 2025-05-07T20:26:11.5866577Z 2025-05-07T20:26:11.5866595Z 2025-05-07T20:26:11.5866599Z 2025-05-07T20:26:11.5866603Z 2025-05-07T20:26:11.5868563Z 2025-05-07T20:26:11.5934248Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:11.5934829Z 2025-05-07T20:26:11.5934835Z 2025-05-07T20:26:11.5934843Z 2025-05-07T20:26:11.5934850Z 2025-05-07T20:26:11.5934858Z 2025-05-07T20:26:11.5934865Z 2025-05-07T20:26:11.5934873Z 2025-05-07T20:26:11.5934891Z 2025-05-07T20:26:11.5934900Z 2025-05-07T20:26:11.5934906Z 2025-05-07T20:26:11.5934912Z 2025-05-07T20:26:11.5934919Z 2025-05-07T20:26:11.5934925Z 2025-05-07T20:26:11.5934931Z 2025-05-07T20:26:11.5934936Z 2025-05-07T20:26:11.6597349Z cuda-nvcc-dev_linux- | 10.8 MB | ##6 | 27%  2025-05-07T20:26:11.6597690Z 2025-05-07T20:26:11.6597694Z 2025-05-07T20:26:11.6597698Z 2025-05-07T20:26:11.6597702Z 2025-05-07T20:26:11.6597705Z 2025-05-07T20:26:11.6597709Z 2025-05-07T20:26:11.6597713Z 2025-05-07T20:26:11.6597717Z 2025-05-07T20:26:11.6597721Z 2025-05-07T20:26:11.6597724Z 2025-05-07T20:26:11.6597728Z 2025-05-07T20:26:11.6597732Z 2025-05-07T20:26:11.6597767Z 2025-05-07T20:26:11.6598803Z 2025-05-07T20:26:11.6868605Z libnvjitlink-12.6.85 | 14.9 MB | #########8 | 98%  2025-05-07T20:26:11.6869039Z 2025-05-07T20:26:11.6869055Z 2025-05-07T20:26:11.6869061Z 2025-05-07T20:26:11.6869066Z 2025-05-07T20:26:11.6869071Z 2025-05-07T20:26:11.6869076Z 2025-05-07T20:26:11.6869082Z 2025-05-07T20:26:11.6869087Z 2025-05-07T20:26:11.6869092Z 2025-05-07T20:26:11.6869098Z 2025-05-07T20:26:11.6869103Z 2025-05-07T20:26:11.6869109Z 2025-05-07T20:26:11.6869114Z 2025-05-07T20:26:11.6869119Z 2025-05-07T20:26:11.6869124Z 2025-05-07T20:26:11.6871182Z 2025-05-07T20:26:11.6938001Z cuda-nvvm-tools-12.6 | 10.4 MB | ###1 | 32%  2025-05-07T20:26:11.6938336Z 2025-05-07T20:26:11.6938341Z 2025-05-07T20:26:11.6938344Z 2025-05-07T20:26:11.6938348Z 2025-05-07T20:26:11.6938351Z 2025-05-07T20:26:11.6938355Z 2025-05-07T20:26:11.6938359Z 2025-05-07T20:26:11.6938363Z 2025-05-07T20:26:11.6938394Z 2025-05-07T20:26:11.6938398Z 2025-05-07T20:26:11.6938401Z 2025-05-07T20:26:11.6938412Z 2025-05-07T20:26:11.6938415Z 2025-05-07T20:26:11.6938419Z 2025-05-07T20:26:11.6938422Z 2025-05-07T20:26:11.7870728Z cuda-nvcc-dev_linux- | 10.8 MB | #####3 | 54%  2025-05-07T20:26:11.7871084Z 2025-05-07T20:26:11.7871088Z 2025-05-07T20:26:11.7871092Z 2025-05-07T20:26:11.7871096Z 2025-05-07T20:26:11.7871099Z 2025-05-07T20:26:11.7871103Z 2025-05-07T20:26:11.7871107Z 2025-05-07T20:26:11.7871112Z 2025-05-07T20:26:11.7871116Z 2025-05-07T20:26:11.7871120Z 2025-05-07T20:26:11.7871123Z 2025-05-07T20:26:11.7871127Z 2025-05-07T20:26:11.7871131Z 2025-05-07T20:26:11.7871134Z 2025-05-07T20:26:11.7871138Z 2025-05-07T20:26:11.7871416Z 2025-05-07T20:26:11.7945921Z cuda-nvvm-tools-12.6 | 10.4 MB | ######3 | 64%  2025-05-07T20:26:11.7946304Z 2025-05-07T20:26:11.7946310Z 2025-05-07T20:26:11.7946341Z 2025-05-07T20:26:11.7946359Z 2025-05-07T20:26:11.7946364Z 2025-05-07T20:26:11.7946370Z 2025-05-07T20:26:11.7946375Z 2025-05-07T20:26:11.7946380Z 2025-05-07T20:26:11.7946386Z 2025-05-07T20:26:11.7946391Z 2025-05-07T20:26:11.7946396Z 2025-05-07T20:26:11.7946410Z 2025-05-07T20:26:11.7946415Z 2025-05-07T20:26:11.7946420Z 2025-05-07T20:26:11.7946425Z 2025-05-07T20:26:11.8872624Z cuda-nvcc-dev_linux- | 10.8 MB | ########1 | 82%  2025-05-07T20:26:11.8873083Z 2025-05-07T20:26:11.8875965Z 2025-05-07T20:26:11.8876281Z 2025-05-07T20:26:11.8876295Z 2025-05-07T20:26:11.8878487Z 2025-05-07T20:26:11.8878497Z 2025-05-07T20:26:11.8878506Z 2025-05-07T20:26:11.8878514Z 2025-05-07T20:26:11.8878522Z 2025-05-07T20:26:11.8878531Z 2025-05-07T20:26:11.8878539Z 2025-05-07T20:26:11.8878548Z 2025-05-07T20:26:11.8878555Z 2025-05-07T20:26:11.8878563Z 2025-05-07T20:26:11.8878569Z 2025-05-07T20:26:11.8878574Z 2025-05-07T20:26:11.9484858Z cuda-nvvm-tools-12.6 | 10.4 MB | #########5 | 96%  2025-05-07T20:26:11.9485411Z 2025-05-07T20:26:11.9485415Z 2025-05-07T20:26:11.9485420Z 2025-05-07T20:26:11.9485424Z 2025-05-07T20:26:11.9485428Z 2025-05-07T20:26:11.9485433Z 2025-05-07T20:26:11.9485437Z 2025-05-07T20:26:11.9485442Z 2025-05-07T20:26:11.9485445Z 2025-05-07T20:26:11.9485449Z 2025-05-07T20:26:11.9485453Z 2025-05-07T20:26:11.9485456Z 2025-05-07T20:26:11.9485460Z 2025-05-07T20:26:12.0097824Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:12.0098434Z 2025-05-07T20:26:12.0098440Z 2025-05-07T20:26:12.0098445Z 2025-05-07T20:26:12.0098451Z 2025-05-07T20:26:12.0098456Z 2025-05-07T20:26:12.0098462Z 2025-05-07T20:26:12.0098467Z 2025-05-07T20:26:12.0098472Z 2025-05-07T20:26:12.0098478Z 2025-05-07T20:26:12.0098483Z 2025-05-07T20:26:12.0098502Z 2025-05-07T20:26:12.0098507Z 2025-05-07T20:26:12.0098512Z 2025-05-07T20:26:12.0098517Z 2025-05-07T20:26:12.0098551Z 2025-05-07T20:26:12.0098557Z 2025-05-07T20:26:12.0107192Z 2025-05-07T20:26:12.1099173Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:12.1099598Z 2025-05-07T20:26:12.1099602Z 2025-05-07T20:26:12.1099606Z 2025-05-07T20:26:12.1099610Z 2025-05-07T20:26:12.1099613Z 2025-05-07T20:26:12.1099617Z 2025-05-07T20:26:12.1099621Z 2025-05-07T20:26:12.1099626Z 2025-05-07T20:26:12.1099630Z 2025-05-07T20:26:12.1099633Z 2025-05-07T20:26:12.1099637Z 2025-05-07T20:26:12.1099641Z 2025-05-07T20:26:12.1099644Z 2025-05-07T20:26:12.1099648Z 2025-05-07T20:26:12.1099652Z 2025-05-07T20:26:12.1099655Z 2025-05-07T20:26:12.1099659Z 2025-05-07T20:26:12.2032015Z cuda-sanitizer-api-1 | 8.9 MB | ###9 | 39%  2025-05-07T20:26:12.2032423Z 2025-05-07T20:26:12.2032428Z 2025-05-07T20:26:12.2032431Z 2025-05-07T20:26:12.2032435Z 2025-05-07T20:26:12.2032439Z 2025-05-07T20:26:12.2032479Z 2025-05-07T20:26:12.2032493Z 2025-05-07T20:26:12.2032497Z 2025-05-07T20:26:12.2032501Z 2025-05-07T20:26:12.2032504Z 2025-05-07T20:26:12.2032508Z 2025-05-07T20:26:12.2032512Z 2025-05-07T20:26:12.2032515Z 2025-05-07T20:26:12.2032519Z 2025-05-07T20:26:12.2036547Z 2025-05-07T20:26:12.2080079Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:12.2080423Z 2025-05-07T20:26:12.2080427Z 2025-05-07T20:26:12.2080431Z 2025-05-07T20:26:12.2080435Z 2025-05-07T20:26:12.2080439Z 2025-05-07T20:26:12.2080442Z 2025-05-07T20:26:12.2080446Z 2025-05-07T20:26:12.2080449Z 2025-05-07T20:26:12.2080453Z 2025-05-07T20:26:12.2080456Z 2025-05-07T20:26:12.2080460Z 2025-05-07T20:26:12.2080463Z 2025-05-07T20:26:12.2080467Z 2025-05-07T20:26:12.2084329Z 2025-05-07T20:26:12.2099306Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:12.2099646Z 2025-05-07T20:26:12.2099651Z 2025-05-07T20:26:12.2099676Z 2025-05-07T20:26:12.2099681Z 2025-05-07T20:26:12.2099687Z 2025-05-07T20:26:12.2099692Z 2025-05-07T20:26:12.2099712Z 2025-05-07T20:26:12.2099717Z 2025-05-07T20:26:12.2099722Z 2025-05-07T20:26:12.2099727Z 2025-05-07T20:26:12.2099732Z 2025-05-07T20:26:12.2099737Z 2025-05-07T20:26:12.2099742Z 2025-05-07T20:26:12.2099747Z 2025-05-07T20:26:12.2099752Z 2025-05-07T20:26:12.2099757Z 2025-05-07T20:26:12.2109949Z 2025-05-07T20:26:12.2168454Z cuda-sanitizer-api-1 | 8.9 MB | ########1 | 82%  2025-05-07T20:26:12.2168870Z 2025-05-07T20:26:12.2168874Z 2025-05-07T20:26:12.2168877Z 2025-05-07T20:26:12.2168881Z 2025-05-07T20:26:12.2168885Z 2025-05-07T20:26:12.2168888Z 2025-05-07T20:26:12.2168892Z 2025-05-07T20:26:12.2168896Z 2025-05-07T20:26:12.2168899Z 2025-05-07T20:26:12.2168903Z 2025-05-07T20:26:12.2168907Z 2025-05-07T20:26:12.2168910Z 2025-05-07T20:26:12.2168914Z 2025-05-07T20:26:12.2168918Z 2025-05-07T20:26:12.2169155Z 2025-05-07T20:26:12.2172185Z 2025-05-07T20:26:12.2620725Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:12.2621175Z 2025-05-07T20:26:12.2621186Z 2025-05-07T20:26:12.2621190Z 2025-05-07T20:26:12.2621193Z 2025-05-07T20:26:12.2621197Z 2025-05-07T20:26:12.2621201Z 2025-05-07T20:26:12.2621206Z 2025-05-07T20:26:12.2621209Z 2025-05-07T20:26:12.2621214Z 2025-05-07T20:26:12.2621218Z 2025-05-07T20:26:12.2621221Z 2025-05-07T20:26:12.2621225Z 2025-05-07T20:26:12.2621228Z 2025-05-07T20:26:12.2621232Z 2025-05-07T20:26:12.2621236Z 2025-05-07T20:26:12.2621239Z 2025-05-07T20:26:12.2621243Z 2025-05-07T20:26:12.2621247Z 2025-05-07T20:26:12.2739274Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:12.2739720Z 2025-05-07T20:26:12.2739726Z 2025-05-07T20:26:12.2739731Z 2025-05-07T20:26:12.2739737Z 2025-05-07T20:26:12.2739742Z 2025-05-07T20:26:12.2739747Z 2025-05-07T20:26:12.2739792Z 2025-05-07T20:26:12.2739799Z 2025-05-07T20:26:12.2739804Z 2025-05-07T20:26:12.2739809Z 2025-05-07T20:26:12.2739829Z 2025-05-07T20:26:12.2739833Z 2025-05-07T20:26:12.2739836Z 2025-05-07T20:26:12.2739840Z 2025-05-07T20:26:12.2739844Z 2025-05-07T20:26:12.2739847Z 2025-05-07T20:26:12.2739851Z 2025-05-07T20:26:12.2739854Z 2025-05-07T20:26:12.2739858Z 2025-05-07T20:26:12.3626628Z ... (more hidden) ... 2025-05-07T20:26:12.3626957Z 2025-05-07T20:26:12.3626962Z 2025-05-07T20:26:12.3626965Z 2025-05-07T20:26:12.3626969Z 2025-05-07T20:26:12.3626974Z 2025-05-07T20:26:12.3626978Z 2025-05-07T20:26:12.3626982Z 2025-05-07T20:26:12.3626986Z 2025-05-07T20:26:12.3626989Z 2025-05-07T20:26:12.3626994Z 2025-05-07T20:26:12.3626997Z 2025-05-07T20:26:12.3627001Z 2025-05-07T20:26:12.3627005Z 2025-05-07T20:26:12.3627008Z 2025-05-07T20:26:12.3627012Z 2025-05-07T20:26:12.3627016Z 2025-05-07T20:26:12.3627019Z 2025-05-07T20:26:12.3627372Z 2025-05-07T20:26:12.3742146Z cuda-nvvm-impl-12.6. | 7.7 MB | ###8 | 39%  2025-05-07T20:26:12.3742617Z 2025-05-07T20:26:12.3742623Z 2025-05-07T20:26:12.3742638Z 2025-05-07T20:26:12.3742643Z 2025-05-07T20:26:12.3742648Z 2025-05-07T20:26:12.3742653Z 2025-05-07T20:26:12.3742659Z 2025-05-07T20:26:12.3742664Z 2025-05-07T20:26:12.3742669Z 2025-05-07T20:26:12.3742675Z 2025-05-07T20:26:12.3742680Z 2025-05-07T20:26:12.3742685Z 2025-05-07T20:26:12.3742690Z 2025-05-07T20:26:12.3742696Z 2025-05-07T20:26:12.3742701Z 2025-05-07T20:26:12.3742707Z 2025-05-07T20:26:12.3742712Z 2025-05-07T20:26:12.3742717Z 2025-05-07T20:26:12.3742723Z 2025-05-07T20:26:12.4632820Z ... (more hidden) ... 2025-05-07T20:26:12.4633204Z 2025-05-07T20:26:12.4633210Z 2025-05-07T20:26:12.4633215Z 2025-05-07T20:26:12.4633221Z 2025-05-07T20:26:12.4633226Z 2025-05-07T20:26:12.4633231Z 2025-05-07T20:26:12.4633255Z 2025-05-07T20:26:12.4633270Z 2025-05-07T20:26:12.4633274Z 2025-05-07T20:26:12.4633277Z 2025-05-07T20:26:12.4633281Z 2025-05-07T20:26:12.4633284Z 2025-05-07T20:26:12.4633288Z 2025-05-07T20:26:12.4633291Z 2025-05-07T20:26:12.4633295Z 2025-05-07T20:26:12.4633298Z 2025-05-07T20:26:12.4633302Z 2025-05-07T20:26:12.4634701Z 2025-05-07T20:26:12.5366341Z cuda-nvvm-impl-12.6. | 7.7 MB | #######7 | 77%  2025-05-07T20:26:12.5366695Z 2025-05-07T20:26:12.5366700Z 2025-05-07T20:26:12.5366706Z 2025-05-07T20:26:12.5366712Z 2025-05-07T20:26:12.5366718Z 2025-05-07T20:26:12.5366724Z 2025-05-07T20:26:12.5366740Z 2025-05-07T20:26:12.5366745Z 2025-05-07T20:26:12.5366751Z 2025-05-07T20:26:12.5366756Z 2025-05-07T20:26:12.5366761Z 2025-05-07T20:26:12.5366766Z 2025-05-07T20:26:12.5366771Z 2025-05-07T20:26:12.5366776Z 2025-05-07T20:26:12.5366781Z 2025-05-07T20:26:12.5366786Z 2025-05-07T20:26:12.5370487Z 2025-05-07T20:26:12.5381737Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:12.5382260Z 2025-05-07T20:26:12.5382264Z 2025-05-07T20:26:12.5382267Z 2025-05-07T20:26:12.5382271Z 2025-05-07T20:26:12.5382275Z 2025-05-07T20:26:12.5382278Z 2025-05-07T20:26:12.5382282Z 2025-05-07T20:26:12.5382286Z 2025-05-07T20:26:12.5382289Z 2025-05-07T20:26:12.5382293Z 2025-05-07T20:26:12.5382297Z 2025-05-07T20:26:12.5382300Z 2025-05-07T20:26:12.5382304Z 2025-05-07T20:26:12.5382308Z 2025-05-07T20:26:12.5382311Z 2025-05-07T20:26:12.5382322Z 2025-05-07T20:26:12.5382325Z 2025-05-07T20:26:12.5382329Z 2025-05-07T20:26:12.5385019Z 2025-05-07T20:26:12.7706390Z ... (more hidden) ... 2025-05-07T20:26:12.7706731Z 2025-05-07T20:26:12.7706735Z 2025-05-07T20:26:12.7706739Z 2025-05-07T20:26:12.7706743Z 2025-05-07T20:26:12.7706746Z 2025-05-07T20:26:12.7706750Z 2025-05-07T20:26:12.7706754Z 2025-05-07T20:26:12.7706759Z 2025-05-07T20:26:12.7706803Z 2025-05-07T20:26:12.7706807Z 2025-05-07T20:26:12.7706810Z 2025-05-07T20:26:12.7706814Z 2025-05-07T20:26:12.7706817Z 2025-05-07T20:26:12.7706833Z 2025-05-07T20:26:12.7706837Z 2025-05-07T20:26:12.7706841Z 2025-05-07T20:26:12.7706845Z 2025-05-07T20:26:12.7706848Z 2025-05-07T20:26:13.1529068Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:13.1529492Z 2025-05-07T20:26:13.1529497Z 2025-05-07T20:26:13.1529503Z 2025-05-07T20:26:13.1529508Z 2025-05-07T20:26:13.1529513Z 2025-05-07T20:26:13.1531074Z 2025-05-07T20:26:14.1968817Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:14.1969132Z 2025-05-07T20:26:14.1969136Z 2025-05-07T20:26:14.1969139Z 2025-05-07T20:26:14.1969143Z 2025-05-07T20:26:14.1969674Z 2025-05-07T20:26:14.3518001Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:14.3518328Z 2025-05-07T20:26:14.3518333Z 2025-05-07T20:26:14.3518390Z 2025-05-07T20:26:14.3518394Z 2025-05-07T20:26:14.3518398Z 2025-05-07T20:26:14.3518401Z 2025-05-07T20:26:14.3518405Z 2025-05-07T20:26:14.3518409Z 2025-05-07T20:26:14.3518839Z 2025-05-07T20:26:14.6884562Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:14.6885032Z 2025-05-07T20:26:14.6885037Z 2025-05-07T20:26:14.6885043Z 2025-05-07T20:26:14.6885050Z 2025-05-07T20:26:14.6885055Z 2025-05-07T20:26:14.6885060Z 2025-05-07T20:26:14.6885065Z 2025-05-07T20:26:14.6885071Z 2025-05-07T20:26:14.9240116Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:14.9240531Z 2025-05-07T20:26:14.9240538Z 2025-05-07T20:26:14.9240544Z 2025-05-07T20:26:14.9240549Z 2025-05-07T20:26:14.9240554Z 2025-05-07T20:26:14.9240560Z 2025-05-07T20:26:14.9240566Z 2025-05-07T20:26:14.9478397Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:14.9478860Z 2025-05-07T20:26:14.9478910Z 2025-05-07T20:26:14.9478917Z 2025-05-07T20:26:14.9478923Z 2025-05-07T20:26:14.9478928Z 2025-05-07T20:26:14.9478934Z 2025-05-07T20:26:14.9478940Z 2025-05-07T20:26:14.9478945Z 2025-05-07T20:26:14.9478951Z 2025-05-07T20:26:14.9479662Z 2025-05-07T20:26:15.7938456Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:15.7963771Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:15.7964096Z 2025-05-07T20:26:15.7964264Z 2025-05-07T20:26:15.7964270Z 2025-05-07T20:26:15.7964274Z 2025-05-07T20:26:15.7964277Z 2025-05-07T20:26:15.7964281Z 2025-05-07T20:26:15.7964285Z 2025-05-07T20:26:15.7964289Z 2025-05-07T20:26:15.7964294Z 2025-05-07T20:26:15.7964298Z 2025-05-07T20:26:15.7964302Z 2025-05-07T20:26:15.7964860Z 2025-05-07T20:26:16.1313631Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:16.1313983Z 2025-05-07T20:26:16.1313987Z 2025-05-07T20:26:16.1313991Z 2025-05-07T20:26:16.1314412Z 2025-05-07T20:26:16.1314417Z 2025-05-07T20:26:16.1314421Z 2025-05-07T20:26:16.1314425Z 2025-05-07T20:26:16.1314428Z 2025-05-07T20:26:16.1314432Z 2025-05-07T20:26:16.1314436Z 2025-05-07T20:26:16.1314439Z 2025-05-07T20:26:16.1314443Z 2025-05-07T20:26:16.1315056Z 2025-05-07T20:26:16.4750171Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:16.4750612Z 2025-05-07T20:26:16.4750616Z 2025-05-07T20:26:16.4750620Z 2025-05-07T20:26:16.4750624Z 2025-05-07T20:26:16.4750628Z 2025-05-07T20:26:16.4750632Z 2025-05-07T20:26:16.4750636Z 2025-05-07T20:26:16.4750640Z 2025-05-07T20:26:16.4750643Z 2025-05-07T20:26:16.4750647Z 2025-05-07T20:26:16.4750651Z 2025-05-07T20:26:16.4750654Z 2025-05-07T20:26:16.4750658Z 2025-05-07T20:26:16.4750662Z 2025-05-07T20:26:16.4750665Z 2025-05-07T20:26:16.6996991Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:16.6997429Z 2025-05-07T20:26:16.6997486Z 2025-05-07T20:26:16.6997492Z 2025-05-07T20:26:16.6997496Z 2025-05-07T20:26:16.6997502Z 2025-05-07T20:26:16.6997506Z 2025-05-07T20:26:16.6997511Z 2025-05-07T20:26:16.6997516Z 2025-05-07T20:26:16.6997536Z 2025-05-07T20:26:16.6997541Z 2025-05-07T20:26:16.6997545Z 2025-05-07T20:26:16.6997550Z 2025-05-07T20:26:16.6997555Z 2025-05-07T20:26:16.6997568Z 2025-05-07T20:26:16.7543872Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:16.7544335Z 2025-05-07T20:26:16.7544341Z 2025-05-07T20:26:16.7544346Z 2025-05-07T20:26:16.7544351Z 2025-05-07T20:26:16.7544357Z 2025-05-07T20:26:16.7544362Z 2025-05-07T20:26:16.7544368Z 2025-05-07T20:26:16.7544373Z 2025-05-07T20:26:16.7544379Z 2025-05-07T20:26:16.7544384Z 2025-05-07T20:26:16.7544389Z 2025-05-07T20:26:16.8937358Z python-3.13.0 | 31.5 MB | ########## | 100%  2025-05-07T20:26:16.8937764Z 2025-05-07T20:26:16.8937770Z 2025-05-07T20:26:16.8937811Z 2025-05-07T20:26:16.8937817Z 2025-05-07T20:26:16.8937822Z 2025-05-07T20:26:16.8937827Z 2025-05-07T20:26:16.8937841Z 2025-05-07T20:26:16.8937847Z 2025-05-07T20:26:16.8937852Z 2025-05-07T20:26:16.8937857Z 2025-05-07T20:26:16.8937862Z 2025-05-07T20:26:16.8937868Z 2025-05-07T20:26:16.8937872Z 2025-05-07T20:26:16.8937878Z 2025-05-07T20:26:16.8937883Z 2025-05-07T20:26:16.8937888Z 2025-05-07T20:26:17.0875405Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:17.0875881Z 2025-05-07T20:26:17.0875887Z 2025-05-07T20:26:17.0875892Z 2025-05-07T20:26:17.0875897Z 2025-05-07T20:26:17.0875902Z 2025-05-07T20:26:17.0875907Z 2025-05-07T20:26:17.0875912Z 2025-05-07T20:26:17.0875918Z 2025-05-07T20:26:17.0875924Z 2025-05-07T20:26:17.0875929Z 2025-05-07T20:26:17.0875935Z 2025-05-07T20:26:17.0875940Z 2025-05-07T20:26:17.0875945Z 2025-05-07T20:26:17.0875950Z 2025-05-07T20:26:17.0875956Z 2025-05-07T20:26:17.0875993Z 2025-05-07T20:26:17.0876020Z 2025-05-07T20:26:17.1219195Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:17.1219719Z 2025-05-07T20:26:17.1219725Z 2025-05-07T20:26:17.1219740Z 2025-05-07T20:26:17.1219745Z 2025-05-07T20:26:17.1219751Z 2025-05-07T20:26:17.1219756Z 2025-05-07T20:26:17.1219761Z 2025-05-07T20:26:17.1219766Z 2025-05-07T20:26:17.1219772Z 2025-05-07T20:26:17.1219777Z 2025-05-07T20:26:17.1219782Z 2025-05-07T20:26:17.1219788Z 2025-05-07T20:26:17.1219793Z 2025-05-07T20:26:17.1219798Z 2025-05-07T20:26:17.1219803Z 2025-05-07T20:26:17.1219809Z 2025-05-07T20:26:17.1219814Z 2025-05-07T20:26:17.1219819Z 2025-05-07T20:26:17.1219824Z 2025-05-07T20:26:17.3803692Z ... (more hidden) ... 2025-05-07T20:26:17.3804102Z 2025-05-07T20:26:17.3804108Z 2025-05-07T20:26:17.3804113Z 2025-05-07T20:26:17.3804118Z 2025-05-07T20:26:17.3804125Z 2025-05-07T20:26:17.3804411Z 2025-05-07T20:26:17.3804579Z 2025-05-07T20:26:17.3804583Z 2025-05-07T20:26:17.3804587Z 2025-05-07T20:26:17.3804590Z 2025-05-07T20:26:17.3804594Z 2025-05-07T20:26:17.3804597Z 2025-05-07T20:26:17.3804614Z 2025-05-07T20:26:17.3804619Z 2025-05-07T20:26:17.3804622Z 2025-05-07T20:26:17.3804626Z 2025-05-07T20:26:17.3804629Z 2025-05-07T20:26:17.3804633Z 2025-05-07T20:26:19.4009366Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:19.4009858Z 2025-05-07T20:26:24.0427236Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:24.0435310Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:24.0435657Z 2025-05-07T20:26:24.0435834Z 2025-05-07T20:26:24.0435841Z 2025-05-07T20:26:24.0435846Z 2025-05-07T20:26:24.0435851Z 2025-05-07T20:26:24.0435857Z 2025-05-07T20:26:24.0435863Z 2025-05-07T20:26:24.0435869Z 2025-05-07T20:26:24.0435877Z 2025-05-07T20:26:24.0435884Z 2025-05-07T20:26:24.0435922Z 2025-05-07T20:26:24.0435941Z 2025-05-07T20:26:24.0435947Z 2025-05-07T20:26:24.0435952Z 2025-05-07T20:26:24.0435957Z 2025-05-07T20:26:24.0435983Z 2025-05-07T20:26:24.0435988Z 2025-05-07T20:26:24.0435993Z 2025-05-07T20:26:24.0436009Z 2025-05-07T20:26:24.0436153Z 2025-05-07T20:26:24.0436656Z  2025-05-07T20:26:24.0437075Z 2025-05-07T20:26:24.0437347Z 2025-05-07T20:26:24.0437564Z  2025-05-07T20:26:24.0437858Z 2025-05-07T20:26:24.0437864Z 2025-05-07T20:26:24.0438084Z  2025-05-07T20:26:24.0438368Z 2025-05-07T20:26:24.0438381Z 2025-05-07T20:26:24.0438387Z 2025-05-07T20:26:24.0438610Z  2025-05-07T20:26:24.0438890Z 2025-05-07T20:26:24.0438895Z 2025-05-07T20:26:24.0438901Z 2025-05-07T20:26:24.0438913Z 2025-05-07T20:26:24.0439157Z  2025-05-07T20:26:24.0439428Z 2025-05-07T20:26:24.0439433Z 2025-05-07T20:26:24.0439438Z 2025-05-07T20:26:24.0439444Z 2025-05-07T20:26:24.0439449Z 2025-05-07T20:26:24.0439701Z  2025-05-07T20:26:24.0439981Z 2025-05-07T20:26:24.0439986Z 2025-05-07T20:26:24.0439991Z 2025-05-07T20:26:24.0439996Z 2025-05-07T20:26:24.0440001Z 2025-05-07T20:26:24.0440006Z 2025-05-07T20:26:24.0440249Z  2025-05-07T20:26:24.0440527Z 2025-05-07T20:26:24.0440533Z 2025-05-07T20:26:24.0440537Z 2025-05-07T20:26:24.0440543Z 2025-05-07T20:26:24.0440548Z 2025-05-07T20:26:24.0440553Z 2025-05-07T20:26:24.0440558Z 2025-05-07T20:26:24.0440802Z  2025-05-07T20:26:24.0441109Z 2025-05-07T20:26:24.0441114Z 2025-05-07T20:26:24.0441132Z 2025-05-07T20:26:24.0441137Z 2025-05-07T20:26:24.0441142Z 2025-05-07T20:26:24.0441147Z 2025-05-07T20:26:24.0441153Z 2025-05-07T20:26:24.0441158Z 2025-05-07T20:26:24.0441450Z  2025-05-07T20:26:24.0441740Z 2025-05-07T20:26:24.0441745Z 2025-05-07T20:26:24.0441750Z 2025-05-07T20:26:24.0441763Z 2025-05-07T20:26:24.0441768Z 2025-05-07T20:26:24.0441773Z 2025-05-07T20:26:24.0441778Z 2025-05-07T20:26:24.0441784Z 2025-05-07T20:26:24.0441789Z 2025-05-07T20:26:24.0442034Z  2025-05-07T20:26:24.0442327Z 2025-05-07T20:26:24.0442333Z 2025-05-07T20:26:24.0442338Z 2025-05-07T20:26:24.0442343Z 2025-05-07T20:26:24.0442348Z 2025-05-07T20:26:24.0442354Z 2025-05-07T20:26:24.0442359Z 2025-05-07T20:26:24.0442364Z 2025-05-07T20:26:24.0442369Z 2025-05-07T20:26:24.0442374Z 2025-05-07T20:26:24.0442888Z  2025-05-07T20:26:24.0443395Z 2025-05-07T20:26:24.0443401Z 2025-05-07T20:26:24.0443406Z 2025-05-07T20:26:24.0443411Z 2025-05-07T20:26:24.0443416Z 2025-05-07T20:26:24.0443421Z 2025-05-07T20:26:24.0443426Z 2025-05-07T20:26:24.0443431Z 2025-05-07T20:26:24.0443437Z 2025-05-07T20:26:24.0443442Z 2025-05-07T20:26:24.0443447Z 2025-05-07T20:26:24.0443756Z  2025-05-07T20:26:24.0444053Z 2025-05-07T20:26:24.0444058Z 2025-05-07T20:26:24.0444064Z 2025-05-07T20:26:24.0444070Z 2025-05-07T20:26:24.0444075Z 2025-05-07T20:26:24.0444081Z 2025-05-07T20:26:24.0444086Z 2025-05-07T20:26:24.0444091Z 2025-05-07T20:26:24.0444095Z 2025-05-07T20:26:24.0444100Z 2025-05-07T20:26:24.0444105Z 2025-05-07T20:26:24.0444109Z 2025-05-07T20:26:24.0444376Z  2025-05-07T20:26:24.0444677Z 2025-05-07T20:26:24.0444693Z 2025-05-07T20:26:24.0444705Z 2025-05-07T20:26:24.0444710Z 2025-05-07T20:26:24.0444716Z 2025-05-07T20:26:24.0444721Z 2025-05-07T20:26:24.0444726Z 2025-05-07T20:26:24.0444742Z 2025-05-07T20:26:24.0444747Z 2025-05-07T20:26:24.0444752Z 2025-05-07T20:26:24.0444757Z 2025-05-07T20:26:24.0444763Z 2025-05-07T20:26:24.0444768Z 2025-05-07T20:26:24.0445032Z  2025-05-07T20:26:24.0445335Z 2025-05-07T20:26:24.0445340Z 2025-05-07T20:26:24.0445346Z 2025-05-07T20:26:24.0445350Z 2025-05-07T20:26:24.0445356Z 2025-05-07T20:26:24.0445361Z 2025-05-07T20:26:24.0445366Z 2025-05-07T20:26:24.0445371Z 2025-05-07T20:26:24.0445376Z 2025-05-07T20:26:24.0445381Z 2025-05-07T20:26:24.0445386Z 2025-05-07T20:26:24.0445391Z 2025-05-07T20:26:24.0445396Z 2025-05-07T20:26:24.0445401Z 2025-05-07T20:26:24.0445670Z  2025-05-07T20:26:24.0445909Z 2025-05-07T20:26:24.0445919Z 2025-05-07T20:26:24.0445922Z 2025-05-07T20:26:24.0445926Z 2025-05-07T20:26:24.0445929Z 2025-05-07T20:26:24.0445933Z 2025-05-07T20:26:24.0445937Z 2025-05-07T20:26:24.0445940Z 2025-05-07T20:26:24.0445944Z 2025-05-07T20:26:24.0445947Z 2025-05-07T20:26:24.0445951Z 2025-05-07T20:26:24.0445955Z 2025-05-07T20:26:24.0445958Z 2025-05-07T20:26:24.0445962Z 2025-05-07T20:26:24.0445966Z 2025-05-07T20:26:24.0446198Z  2025-05-07T20:26:24.0446423Z 2025-05-07T20:26:24.0446427Z 2025-05-07T20:26:24.0446431Z 2025-05-07T20:26:24.0446434Z 2025-05-07T20:26:24.0446444Z 2025-05-07T20:26:24.0446448Z 2025-05-07T20:26:24.0446452Z 2025-05-07T20:26:24.0446455Z 2025-05-07T20:26:24.0446459Z 2025-05-07T20:26:24.0446462Z 2025-05-07T20:26:24.0446466Z 2025-05-07T20:26:24.0446469Z 2025-05-07T20:26:24.0446473Z 2025-05-07T20:26:24.0446477Z 2025-05-07T20:26:24.0446480Z 2025-05-07T20:26:24.0446487Z 2025-05-07T20:26:24.0446695Z  2025-05-07T20:26:24.0446930Z 2025-05-07T20:26:24.0446934Z 2025-05-07T20:26:24.0446938Z 2025-05-07T20:26:24.0446941Z 2025-05-07T20:26:24.0446945Z 2025-05-07T20:26:24.0446948Z 2025-05-07T20:26:24.0446952Z 2025-05-07T20:26:24.0446955Z 2025-05-07T20:26:24.0446959Z 2025-05-07T20:26:24.0446962Z 2025-05-07T20:26:24.0446966Z 2025-05-07T20:26:24.0446969Z 2025-05-07T20:26:24.0446973Z 2025-05-07T20:26:24.0446977Z 2025-05-07T20:26:24.0446980Z 2025-05-07T20:26:24.0446984Z 2025-05-07T20:26:24.0446987Z 2025-05-07T20:26:24.0447207Z  2025-05-07T20:26:24.0447437Z 2025-05-07T20:26:24.0447441Z 2025-05-07T20:26:24.0447445Z 2025-05-07T20:26:24.0447448Z 2025-05-07T20:26:24.0447452Z 2025-05-07T20:26:24.0447455Z 2025-05-07T20:26:24.0447459Z 2025-05-07T20:26:24.0447468Z 2025-05-07T20:26:24.0447637Z 2025-05-07T20:26:24.0447641Z 2025-05-07T20:26:24.0447645Z 2025-05-07T20:26:24.0447648Z 2025-05-07T20:26:24.0447652Z 2025-05-07T20:26:24.0447655Z 2025-05-07T20:26:24.0447659Z 2025-05-07T20:26:24.0447662Z 2025-05-07T20:26:24.0447666Z 2025-05-07T20:26:24.0447669Z 2025-05-07T20:26:24.0447891Z  2025-05-07T20:26:24.0448131Z 2025-05-07T20:26:24.0448135Z 2025-05-07T20:26:24.0448246Z  2025-05-07T20:26:24.0448400Z 2025-05-07T20:26:24.0448406Z 2025-05-07T20:26:24.0448558Z  2025-05-07T20:26:24.0448738Z 2025-05-07T20:26:24.0448741Z 2025-05-07T20:26:24.0448745Z 2025-05-07T20:26:24.0448898Z  2025-05-07T20:26:24.0449058Z 2025-05-07T20:26:24.0449064Z 2025-05-07T20:26:24.0449069Z 2025-05-07T20:26:24.0449075Z 2025-05-07T20:26:24.0449205Z  2025-05-07T20:26:24.0449390Z 2025-05-07T20:26:24.0449395Z 2025-05-07T20:26:24.0449400Z 2025-05-07T20:26:24.0449415Z 2025-05-07T20:26:24.0449427Z 2025-05-07T20:26:24.0449568Z  2025-05-07T20:26:24.0449691Z 2025-05-07T20:26:24.0449703Z 2025-05-07T20:26:24.0449707Z 2025-05-07T20:26:24.0449710Z 2025-05-07T20:26:24.0449726Z 2025-05-07T20:26:24.0449729Z 2025-05-07T20:26:24.0449839Z  2025-05-07T20:26:24.0449971Z 2025-05-07T20:26:24.0449974Z 2025-05-07T20:26:24.0449978Z 2025-05-07T20:26:24.0449981Z 2025-05-07T20:26:24.0449985Z 2025-05-07T20:26:24.0449988Z 2025-05-07T20:26:24.0449992Z 2025-05-07T20:26:24.0450104Z  2025-05-07T20:26:24.0450241Z 2025-05-07T20:26:24.0450251Z 2025-05-07T20:26:24.0450254Z 2025-05-07T20:26:24.0450258Z 2025-05-07T20:26:24.0450261Z 2025-05-07T20:26:24.0450265Z 2025-05-07T20:26:24.0450268Z 2025-05-07T20:26:24.0450276Z 2025-05-07T20:26:24.0450433Z  2025-05-07T20:26:24.0450646Z 2025-05-07T20:26:24.0450649Z 2025-05-07T20:26:24.0450653Z 2025-05-07T20:26:24.0450657Z 2025-05-07T20:26:24.0450680Z 2025-05-07T20:26:24.0450694Z 2025-05-07T20:26:24.0450698Z 2025-05-07T20:26:24.0450706Z 2025-05-07T20:26:24.0450710Z 2025-05-07T20:26:24.0450895Z  2025-05-07T20:26:24.0451107Z 2025-05-07T20:26:24.0451111Z 2025-05-07T20:26:24.0451116Z 2025-05-07T20:26:24.0451120Z 2025-05-07T20:26:24.0451125Z 2025-05-07T20:26:24.0451129Z 2025-05-07T20:26:24.0451133Z 2025-05-07T20:26:24.0451136Z 2025-05-07T20:26:24.0451140Z 2025-05-07T20:26:24.0451144Z 2025-05-07T20:26:24.0451314Z  2025-05-07T20:26:24.0451546Z 2025-05-07T20:26:24.0451551Z 2025-05-07T20:26:24.0451557Z 2025-05-07T20:26:24.0451562Z 2025-05-07T20:26:24.0451567Z 2025-05-07T20:26:24.0451572Z 2025-05-07T20:26:24.0451575Z 2025-05-07T20:26:24.0451579Z 2025-05-07T20:26:24.0451582Z 2025-05-07T20:26:24.0451586Z 2025-05-07T20:26:24.0451590Z 2025-05-07T20:26:24.0451743Z  2025-05-07T20:26:24.0451975Z 2025-05-07T20:26:24.0451980Z 2025-05-07T20:26:24.0451985Z 2025-05-07T20:26:24.0452003Z 2025-05-07T20:26:24.0452008Z 2025-05-07T20:26:24.0452014Z 2025-05-07T20:26:24.0452018Z 2025-05-07T20:26:24.0452024Z 2025-05-07T20:26:24.0452029Z 2025-05-07T20:26:24.0452035Z 2025-05-07T20:26:24.0452040Z 2025-05-07T20:26:24.0452046Z 2025-05-07T20:26:24.0452206Z  2025-05-07T20:26:24.0452384Z 2025-05-07T20:26:24.0452388Z 2025-05-07T20:26:24.0452392Z 2025-05-07T20:26:24.0452395Z 2025-05-07T20:26:24.0452399Z 2025-05-07T20:26:24.0452402Z 2025-05-07T20:26:24.0452412Z 2025-05-07T20:26:24.0452415Z 2025-05-07T20:26:24.0452419Z 2025-05-07T20:26:24.0452423Z 2025-05-07T20:26:24.0452426Z 2025-05-07T20:26:24.0452430Z 2025-05-07T20:26:24.0452433Z 2025-05-07T20:26:24.0452573Z  2025-05-07T20:26:24.0452760Z 2025-05-07T20:26:24.0452764Z 2025-05-07T20:26:24.0452767Z 2025-05-07T20:26:24.0452771Z 2025-05-07T20:26:24.0452775Z 2025-05-07T20:26:24.0452778Z 2025-05-07T20:26:24.0452782Z 2025-05-07T20:26:24.0452923Z 2025-05-07T20:26:24.0453019Z 2025-05-07T20:26:24.0453024Z 2025-05-07T20:26:24.0453029Z 2025-05-07T20:26:24.0453035Z 2025-05-07T20:26:24.0453040Z 2025-05-07T20:26:24.0453045Z 2025-05-07T20:26:24.0453258Z  2025-05-07T20:26:24.0453450Z 2025-05-07T20:26:24.0453454Z 2025-05-07T20:26:24.0453457Z 2025-05-07T20:26:24.0453461Z 2025-05-07T20:26:24.0453464Z 2025-05-07T20:26:24.0453468Z 2025-05-07T20:26:24.0453471Z 2025-05-07T20:26:24.0453475Z 2025-05-07T20:26:24.0453479Z 2025-05-07T20:26:24.0453482Z 2025-05-07T20:26:24.0453486Z 2025-05-07T20:26:24.0453489Z 2025-05-07T20:26:24.0453493Z 2025-05-07T20:26:24.0453496Z 2025-05-07T20:26:24.0453500Z 2025-05-07T20:26:24.0453799Z  2025-05-07T20:26:24.0462226Z 2025-05-07T20:26:24.0462231Z 2025-05-07T20:26:24.0462235Z 2025-05-07T20:26:24.0462238Z 2025-05-07T20:26:24.0462242Z 2025-05-07T20:26:24.0462245Z 2025-05-07T20:26:24.0462249Z 2025-05-07T20:26:24.0462253Z 2025-05-07T20:26:24.0462272Z 2025-05-07T20:26:24.0462276Z 2025-05-07T20:26:24.0462279Z 2025-05-07T20:26:24.0462299Z 2025-05-07T20:26:24.0462304Z 2025-05-07T20:26:24.0462311Z 2025-05-07T20:26:24.0462316Z 2025-05-07T20:26:24.0462320Z 2025-05-07T20:26:24.0462561Z  2025-05-07T20:26:24.0462789Z 2025-05-07T20:26:24.0462793Z 2025-05-07T20:26:24.0462797Z 2025-05-07T20:26:24.0462801Z 2025-05-07T20:26:24.0462804Z 2025-05-07T20:26:24.0462808Z 2025-05-07T20:26:24.0462811Z 2025-05-07T20:26:24.0462815Z 2025-05-07T20:26:24.0462819Z 2025-05-07T20:26:24.0462822Z 2025-05-07T20:26:24.0462826Z 2025-05-07T20:26:24.0462829Z 2025-05-07T20:26:24.0462833Z 2025-05-07T20:26:24.0462837Z 2025-05-07T20:26:24.0462840Z 2025-05-07T20:26:24.0462844Z 2025-05-07T20:26:24.0462847Z 2025-05-07T20:26:24.0463026Z  2025-05-07T20:26:24.0463324Z 2025-05-07T20:26:24.0463329Z 2025-05-07T20:26:24.0463335Z 2025-05-07T20:26:24.0463349Z 2025-05-07T20:26:24.0463362Z 2025-05-07T20:26:24.0463368Z 2025-05-07T20:26:24.0463414Z 2025-05-07T20:26:24.0463420Z 2025-05-07T20:26:24.0463425Z 2025-05-07T20:26:24.0463429Z 2025-05-07T20:26:24.0463434Z 2025-05-07T20:26:24.0463439Z 2025-05-07T20:26:24.0463444Z 2025-05-07T20:26:24.0463449Z 2025-05-07T20:26:24.0463454Z 2025-05-07T20:26:24.0463459Z 2025-05-07T20:26:24.0463464Z 2025-05-07T20:26:24.0463469Z 2025-05-07T20:26:24.0463705Z  2025-05-07T20:26:24.0464015Z 2025-05-07T20:26:24.0464020Z 2025-05-07T20:26:24.0464156Z  2025-05-07T20:26:24.0464701Z 2025-05-07T20:26:24.0464706Z 2025-05-07T20:26:24.0464863Z  2025-05-07T20:26:24.0465008Z 2025-05-07T20:26:24.0465013Z 2025-05-07T20:26:24.0465018Z 2025-05-07T20:26:24.0465173Z  2025-05-07T20:26:24.0465321Z 2025-05-07T20:26:24.0465327Z 2025-05-07T20:26:24.0465332Z 2025-05-07T20:26:24.0465337Z 2025-05-07T20:26:24.0465486Z  2025-05-07T20:26:24.0465654Z 2025-05-07T20:26:24.0465671Z 2025-05-07T20:26:24.0465677Z 2025-05-07T20:26:24.0465682Z 2025-05-07T20:26:24.0465687Z 2025-05-07T20:26:24.0465833Z  2025-05-07T20:26:24.0466007Z 2025-05-07T20:26:24.0466012Z 2025-05-07T20:26:24.0466017Z 2025-05-07T20:26:24.0466023Z 2025-05-07T20:26:24.0466028Z 2025-05-07T20:26:24.0466033Z 2025-05-07T20:26:24.0466197Z  2025-05-07T20:26:24.0466371Z 2025-05-07T20:26:24.0466376Z 2025-05-07T20:26:24.0466381Z 2025-05-07T20:26:24.0466386Z 2025-05-07T20:26:24.0466392Z 2025-05-07T20:26:24.0466397Z 2025-05-07T20:26:24.0466402Z 2025-05-07T20:26:24.0466556Z  2025-05-07T20:26:24.0466747Z 2025-05-07T20:26:24.0466752Z 2025-05-07T20:26:24.0466757Z 2025-05-07T20:26:24.0466762Z 2025-05-07T20:26:24.0466767Z 2025-05-07T20:26:24.0466773Z 2025-05-07T20:26:24.0466778Z 2025-05-07T20:26:24.0466783Z 2025-05-07T20:26:24.0466946Z  2025-05-07T20:26:24.0467150Z 2025-05-07T20:26:24.0467155Z 2025-05-07T20:26:24.0467892Z 2025-05-07T20:26:24.0467985Z 2025-05-07T20:26:24.0467990Z 2025-05-07T20:26:24.0467995Z 2025-05-07T20:26:24.0468000Z 2025-05-07T20:26:24.0468005Z 2025-05-07T20:26:24.0468011Z 2025-05-07T20:26:24.0468207Z  2025-05-07T20:26:24.0468425Z 2025-05-07T20:26:24.0468430Z 2025-05-07T20:26:24.0468435Z 2025-05-07T20:26:24.0468441Z 2025-05-07T20:26:24.0468446Z 2025-05-07T20:26:24.0468451Z 2025-05-07T20:26:24.0468456Z 2025-05-07T20:26:24.0468461Z 2025-05-07T20:26:24.0468466Z 2025-05-07T20:26:24.0468471Z 2025-05-07T20:26:24.0468646Z  2025-05-07T20:26:24.0468874Z 2025-05-07T20:26:24.0468879Z 2025-05-07T20:26:24.0468884Z 2025-05-07T20:26:24.0468889Z 2025-05-07T20:26:24.0468894Z 2025-05-07T20:26:24.0468899Z 2025-05-07T20:26:24.0468905Z 2025-05-07T20:26:24.0468909Z 2025-05-07T20:26:24.0468915Z 2025-05-07T20:26:24.0468920Z 2025-05-07T20:26:24.0468925Z 2025-05-07T20:26:24.0469110Z  2025-05-07T20:26:24.0469361Z 2025-05-07T20:26:24.0469372Z 2025-05-07T20:26:24.0469378Z 2025-05-07T20:26:24.0469382Z 2025-05-07T20:26:24.0469388Z 2025-05-07T20:26:24.0469393Z 2025-05-07T20:26:24.0469398Z 2025-05-07T20:26:24.0469403Z 2025-05-07T20:26:24.0469408Z 2025-05-07T20:26:24.0469413Z 2025-05-07T20:26:24.0469418Z 2025-05-07T20:26:24.0469423Z 2025-05-07T20:26:24.0469625Z  2025-05-07T20:26:24.0469873Z 2025-05-07T20:26:24.0469878Z 2025-05-07T20:26:24.0469883Z 2025-05-07T20:26:24.0469888Z 2025-05-07T20:26:24.0469893Z 2025-05-07T20:26:24.0469898Z 2025-05-07T20:26:24.0469903Z 2025-05-07T20:26:24.0469908Z 2025-05-07T20:26:24.0469923Z 2025-05-07T20:26:24.0469928Z 2025-05-07T20:26:24.0469933Z 2025-05-07T20:26:24.0470031Z 2025-05-07T20:26:24.0470036Z 2025-05-07T20:26:24.0470226Z  2025-05-07T20:26:24.0470480Z 2025-05-07T20:26:24.0470486Z 2025-05-07T20:26:24.0470503Z 2025-05-07T20:26:24.0470508Z 2025-05-07T20:26:24.0470513Z 2025-05-07T20:26:24.0470526Z 2025-05-07T20:26:24.0470537Z 2025-05-07T20:26:24.0470542Z 2025-05-07T20:26:24.0470547Z 2025-05-07T20:26:24.0470552Z 2025-05-07T20:26:24.0470557Z 2025-05-07T20:26:24.0470562Z 2025-05-07T20:26:24.0470566Z 2025-05-07T20:26:24.0470571Z 2025-05-07T20:26:24.0470763Z  2025-05-07T20:26:24.0471069Z 2025-05-07T20:26:24.0471073Z 2025-05-07T20:26:24.0471078Z 2025-05-07T20:26:24.0471084Z 2025-05-07T20:26:24.0471089Z 2025-05-07T20:26:24.0471094Z 2025-05-07T20:26:24.0471099Z 2025-05-07T20:26:24.0471105Z 2025-05-07T20:26:24.0471111Z 2025-05-07T20:26:24.0471117Z 2025-05-07T20:26:24.0471124Z 2025-05-07T20:26:24.0471130Z 2025-05-07T20:26:24.0471137Z 2025-05-07T20:26:24.0471143Z 2025-05-07T20:26:24.0471150Z 2025-05-07T20:26:24.0471376Z  2025-05-07T20:26:24.0471643Z 2025-05-07T20:26:24.0471648Z 2025-05-07T20:26:24.0471653Z 2025-05-07T20:26:24.0471658Z 2025-05-07T20:26:24.0471663Z 2025-05-07T20:26:24.0471676Z 2025-05-07T20:26:24.0471687Z 2025-05-07T20:26:24.0471692Z 2025-05-07T20:26:24.0471697Z 2025-05-07T20:26:24.0471703Z 2025-05-07T20:26:24.0471708Z 2025-05-07T20:26:24.0471723Z 2025-05-07T20:26:24.0471728Z 2025-05-07T20:26:24.0471733Z 2025-05-07T20:26:24.0471737Z 2025-05-07T20:26:24.0471742Z 2025-05-07T20:26:24.0471952Z  2025-05-07T20:26:24.0472226Z 2025-05-07T20:26:24.0472232Z 2025-05-07T20:26:24.0472246Z 2025-05-07T20:26:24.0472252Z 2025-05-07T20:26:24.0472257Z 2025-05-07T20:26:24.0472262Z 2025-05-07T20:26:24.0472267Z 2025-05-07T20:26:24.0472272Z 2025-05-07T20:26:24.0472277Z 2025-05-07T20:26:24.0472282Z 2025-05-07T20:26:24.0472287Z 2025-05-07T20:26:24.0472292Z 2025-05-07T20:26:24.0472297Z 2025-05-07T20:26:24.0472302Z 2025-05-07T20:26:24.0472308Z 2025-05-07T20:26:24.0472313Z 2025-05-07T20:26:24.0472318Z 2025-05-07T20:26:24.0472533Z  2025-05-07T20:26:24.0472831Z 2025-05-07T20:26:24.0472944Z 2025-05-07T20:26:24.0473034Z 2025-05-07T20:26:24.0473039Z 2025-05-07T20:26:24.0473044Z 2025-05-07T20:26:24.0473049Z 2025-05-07T20:26:24.0473054Z 2025-05-07T20:26:24.0473060Z 2025-05-07T20:26:24.0473065Z 2025-05-07T20:26:24.0473070Z 2025-05-07T20:26:24.0473075Z 2025-05-07T20:26:24.0473080Z 2025-05-07T20:26:24.0473085Z 2025-05-07T20:26:24.0473090Z 2025-05-07T20:26:24.0473096Z 2025-05-07T20:26:24.0473101Z 2025-05-07T20:26:24.0473119Z 2025-05-07T20:26:24.0473124Z 2025-05-07T20:26:24.0473361Z  2025-05-07T20:26:24.0473648Z 2025-05-07T20:26:24.0473653Z 2025-05-07T20:26:24.0473803Z  2025-05-07T20:26:24.0473942Z 2025-05-07T20:26:24.0473947Z 2025-05-07T20:26:24.0474085Z  2025-05-07T20:26:24.0474236Z 2025-05-07T20:26:24.0474241Z 2025-05-07T20:26:24.0474246Z 2025-05-07T20:26:24.0474392Z  2025-05-07T20:26:24.0474550Z 2025-05-07T20:26:24.0474555Z 2025-05-07T20:26:24.0474560Z 2025-05-07T20:26:24.0474565Z 2025-05-07T20:26:24.0474729Z  2025-05-07T20:26:24.0474893Z 2025-05-07T20:26:24.0474899Z 2025-05-07T20:26:24.0474914Z 2025-05-07T20:26:24.0474919Z 2025-05-07T20:26:24.0474924Z 2025-05-07T20:26:24.0475072Z  2025-05-07T20:26:24.0475236Z 2025-05-07T20:26:24.0475241Z 2025-05-07T20:26:24.0475246Z 2025-05-07T20:26:24.0475251Z 2025-05-07T20:26:24.0475265Z 2025-05-07T20:26:24.0475270Z 2025-05-07T20:26:24.0475421Z  2025-05-07T20:26:24.0475591Z 2025-05-07T20:26:24.0475597Z 2025-05-07T20:26:24.0475602Z 2025-05-07T20:26:24.0475607Z 2025-05-07T20:26:24.0475612Z 2025-05-07T20:26:24.0475617Z 2025-05-07T20:26:24.0475632Z 2025-05-07T20:26:24.0475789Z  2025-05-07T20:26:24.0475973Z 2025-05-07T20:26:24.0475978Z 2025-05-07T20:26:24.0475983Z 2025-05-07T20:26:24.0475989Z 2025-05-07T20:26:24.0475994Z 2025-05-07T20:26:24.0475999Z 2025-05-07T20:26:24.0476004Z 2025-05-07T20:26:24.0476020Z 2025-05-07T20:26:24.0476180Z  2025-05-07T20:26:24.0476392Z 2025-05-07T20:26:24.0476397Z 2025-05-07T20:26:24.0476402Z 2025-05-07T20:26:24.0476408Z 2025-05-07T20:26:24.0476413Z 2025-05-07T20:26:24.0476418Z 2025-05-07T20:26:24.0476430Z 2025-05-07T20:26:24.0476436Z 2025-05-07T20:26:24.0476441Z 2025-05-07T20:26:24.0476609Z  2025-05-07T20:26:24.0476813Z 2025-05-07T20:26:24.0476818Z 2025-05-07T20:26:24.0476823Z 2025-05-07T20:26:24.0476828Z 2025-05-07T20:26:24.0476833Z 2025-05-07T20:26:24.0476847Z 2025-05-07T20:26:24.0476852Z 2025-05-07T20:26:24.0476857Z 2025-05-07T20:26:24.0476863Z 2025-05-07T20:26:24.0476868Z 2025-05-07T20:26:24.0477038Z  2025-05-07T20:26:24.0477254Z 2025-05-07T20:26:24.0477259Z 2025-05-07T20:26:24.0477273Z 2025-05-07T20:26:24.0477279Z 2025-05-07T20:26:24.0477284Z 2025-05-07T20:26:24.0477289Z 2025-05-07T20:26:24.0477294Z 2025-05-07T20:26:24.0477299Z 2025-05-07T20:26:24.0477304Z 2025-05-07T20:26:24.0477309Z 2025-05-07T20:26:24.0477314Z 2025-05-07T20:26:24.0477499Z  2025-05-07T20:26:24.0477744Z 2025-05-07T20:26:24.0477749Z 2025-05-07T20:26:24.0477754Z 2025-05-07T20:26:24.0477759Z 2025-05-07T20:26:24.0477764Z 2025-05-07T20:26:24.0477769Z 2025-05-07T20:26:24.0477774Z 2025-05-07T20:26:24.0477779Z 2025-05-07T20:26:24.0477784Z 2025-05-07T20:26:24.0477789Z 2025-05-07T20:26:24.0477794Z 2025-05-07T20:26:24.0477800Z 2025-05-07T20:26:24.0477977Z  2025-05-07T20:26:24.0478219Z 2025-05-07T20:26:24.0478224Z 2025-05-07T20:26:24.0478229Z 2025-05-07T20:26:24.0478234Z 2025-05-07T20:26:24.0478239Z 2025-05-07T20:26:24.0478244Z 2025-05-07T20:26:24.0478249Z 2025-05-07T20:26:24.0478254Z 2025-05-07T20:26:24.0478259Z 2025-05-07T20:26:24.0478270Z 2025-05-07T20:26:24.0478275Z 2025-05-07T20:26:24.0478280Z 2025-05-07T20:26:24.0478286Z 2025-05-07T20:26:24.0478471Z  2025-05-07T20:26:24.0478718Z 2025-05-07T20:26:24.0478723Z 2025-05-07T20:26:24.0478728Z 2025-05-07T20:26:24.0478959Z 2025-05-07T20:26:24.0478980Z 2025-05-07T20:26:24.0478985Z 2025-05-07T20:26:24.0478990Z 2025-05-07T20:26:24.0478996Z 2025-05-07T20:26:24.0479000Z 2025-05-07T20:26:24.0479006Z 2025-05-07T20:26:24.0479011Z 2025-05-07T20:26:24.0479016Z 2025-05-07T20:26:24.0479021Z 2025-05-07T20:26:24.0479026Z 2025-05-07T20:26:24.0479225Z  2025-05-07T20:26:24.0479492Z 2025-05-07T20:26:24.0479497Z 2025-05-07T20:26:24.0479502Z 2025-05-07T20:26:24.0479507Z 2025-05-07T20:26:24.0479512Z 2025-05-07T20:26:24.0479518Z 2025-05-07T20:26:24.0479523Z 2025-05-07T20:26:24.0479528Z 2025-05-07T20:26:24.0479533Z 2025-05-07T20:26:24.0479537Z 2025-05-07T20:26:24.0479542Z 2025-05-07T20:26:24.0479547Z 2025-05-07T20:26:24.0479551Z 2025-05-07T20:26:24.0479557Z 2025-05-07T20:26:24.0479562Z 2025-05-07T20:26:24.0479770Z  2025-05-07T20:26:24.0480035Z 2025-05-07T20:26:24.0480040Z 2025-05-07T20:26:24.0480045Z 2025-05-07T20:26:24.0480065Z 2025-05-07T20:26:24.0480070Z 2025-05-07T20:26:24.0480076Z 2025-05-07T20:26:24.0480081Z 2025-05-07T20:26:24.0480086Z 2025-05-07T20:26:24.0480091Z 2025-05-07T20:26:24.0480096Z 2025-05-07T20:26:24.0480101Z 2025-05-07T20:26:24.0480106Z 2025-05-07T20:26:24.0480111Z 2025-05-07T20:26:24.0480129Z 2025-05-07T20:26:24.0480134Z 2025-05-07T20:26:24.0480139Z 2025-05-07T20:26:24.0480343Z  2025-05-07T20:26:24.0480612Z 2025-05-07T20:26:24.0480617Z 2025-05-07T20:26:24.0480622Z 2025-05-07T20:26:24.0480627Z 2025-05-07T20:26:24.0480632Z 2025-05-07T20:26:24.0480644Z 2025-05-07T20:26:24.0480649Z 2025-05-07T20:26:24.0480654Z 2025-05-07T20:26:24.0480659Z 2025-05-07T20:26:24.0480664Z 2025-05-07T20:26:24.0480669Z 2025-05-07T20:26:24.0480675Z 2025-05-07T20:26:24.0480679Z 2025-05-07T20:26:24.0480685Z 2025-05-07T20:26:24.0480690Z 2025-05-07T20:26:24.0480695Z 2025-05-07T20:26:24.0480700Z 2025-05-07T20:26:24.0480911Z  2025-05-07T20:26:24.0481201Z 2025-05-07T20:26:24.0481206Z 2025-05-07T20:26:24.0481211Z 2025-05-07T20:26:24.0481216Z 2025-05-07T20:26:24.0481221Z 2025-05-07T20:26:24.0481226Z 2025-05-07T20:26:24.0481231Z 2025-05-07T20:26:24.0481236Z 2025-05-07T20:26:24.0481241Z 2025-05-07T20:26:24.0481247Z 2025-05-07T20:26:24.0481251Z 2025-05-07T20:26:24.0481257Z 2025-05-07T20:26:24.0481262Z 2025-05-07T20:26:24.0481267Z 2025-05-07T20:26:24.0481272Z 2025-05-07T20:26:24.0481277Z 2025-05-07T20:26:24.0481282Z 2025-05-07T20:26:24.0481287Z 2025-05-07T20:26:24.0481521Z  2025-05-07T20:26:24.0481795Z 2025-05-07T20:26:24.0481800Z 2025-05-07T20:26:24.0481944Z  2025-05-07T20:26:24.0482081Z 2025-05-07T20:26:24.0482086Z 2025-05-07T20:26:24.0482220Z  2025-05-07T20:26:24.0482370Z 2025-05-07T20:26:24.0482375Z 2025-05-07T20:26:24.0482380Z 2025-05-07T20:26:24.0482522Z  2025-05-07T20:26:24.0482678Z 2025-05-07T20:26:24.0482684Z 2025-05-07T20:26:24.0482703Z 2025-05-07T20:26:24.0482708Z 2025-05-07T20:26:24.0482943Z  2025-05-07T20:26:24.0483098Z 2025-05-07T20:26:24.0483103Z 2025-05-07T20:26:24.0483108Z 2025-05-07T20:26:24.0483121Z 2025-05-07T20:26:24.0483127Z 2025-05-07T20:26:24.0483275Z  2025-05-07T20:26:24.0483434Z 2025-05-07T20:26:24.0483439Z 2025-05-07T20:26:24.0483444Z 2025-05-07T20:26:24.0483449Z 2025-05-07T20:26:24.0483454Z 2025-05-07T20:26:24.0483459Z 2025-05-07T20:26:24.0483616Z  2025-05-07T20:26:24.0483785Z 2025-05-07T20:26:24.0483790Z 2025-05-07T20:26:24.0483796Z 2025-05-07T20:26:24.0483801Z 2025-05-07T20:26:24.0483806Z 2025-05-07T20:26:24.0483811Z 2025-05-07T20:26:24.0483816Z 2025-05-07T20:26:24.0483982Z  2025-05-07T20:26:24.0484174Z 2025-05-07T20:26:24.0484180Z 2025-05-07T20:26:24.0484184Z 2025-05-07T20:26:24.0484189Z 2025-05-07T20:26:24.0484194Z 2025-05-07T20:26:24.0484199Z 2025-05-07T20:26:24.0484203Z 2025-05-07T20:26:24.0484209Z 2025-05-07T20:26:24.0484573Z  2025-05-07T20:26:24.0484768Z 2025-05-07T20:26:24.0484774Z 2025-05-07T20:26:24.0484778Z 2025-05-07T20:26:24.0484784Z 2025-05-07T20:26:24.0484789Z 2025-05-07T20:26:24.0484794Z 2025-05-07T20:26:24.0484799Z 2025-05-07T20:26:24.0484804Z 2025-05-07T20:26:24.0484817Z 2025-05-07T20:26:24.0484979Z  2025-05-07T20:26:24.0485183Z 2025-05-07T20:26:24.0485188Z 2025-05-07T20:26:24.0485193Z 2025-05-07T20:26:24.0485198Z 2025-05-07T20:26:24.0485202Z 2025-05-07T20:26:24.0485208Z 2025-05-07T20:26:24.0485220Z 2025-05-07T20:26:24.0485224Z 2025-05-07T20:26:24.0485230Z 2025-05-07T20:26:24.0485235Z 2025-05-07T20:26:24.0485407Z  2025-05-07T20:26:24.0485623Z 2025-05-07T20:26:24.0485628Z 2025-05-07T20:26:24.0485633Z 2025-05-07T20:26:24.0485638Z 2025-05-07T20:26:24.0485651Z 2025-05-07T20:26:24.0485656Z 2025-05-07T20:26:24.0485661Z 2025-05-07T20:26:24.0485666Z 2025-05-07T20:26:24.0485671Z 2025-05-07T20:26:24.0485689Z 2025-05-07T20:26:24.0485694Z 2025-05-07T20:26:24.0485868Z  2025-05-07T20:26:24.0486104Z 2025-05-07T20:26:24.0486109Z 2025-05-07T20:26:24.0486114Z 2025-05-07T20:26:24.0486119Z 2025-05-07T20:26:24.0486124Z 2025-05-07T20:26:24.0486129Z 2025-05-07T20:26:24.0486134Z 2025-05-07T20:26:24.0486140Z 2025-05-07T20:26:24.0486144Z 2025-05-07T20:26:24.0486149Z 2025-05-07T20:26:24.0486154Z 2025-05-07T20:26:24.0486159Z 2025-05-07T20:26:24.0486334Z  2025-05-07T20:26:24.0486579Z 2025-05-07T20:26:24.0486584Z 2025-05-07T20:26:24.0486589Z 2025-05-07T20:26:24.0486594Z 2025-05-07T20:26:24.0486600Z 2025-05-07T20:26:24.0486605Z 2025-05-07T20:26:24.0486610Z 2025-05-07T20:26:24.0486615Z 2025-05-07T20:26:24.0486619Z 2025-05-07T20:26:24.0486624Z 2025-05-07T20:26:24.0486629Z 2025-05-07T20:26:24.0486634Z 2025-05-07T20:26:24.0486639Z 2025-05-07T20:26:24.0486840Z  2025-05-07T20:26:24.0487088Z 2025-05-07T20:26:24.0487107Z 2025-05-07T20:26:24.0487112Z 2025-05-07T20:26:24.0487117Z 2025-05-07T20:26:24.0487122Z 2025-05-07T20:26:24.0487127Z 2025-05-07T20:26:24.0487132Z 2025-05-07T20:26:24.0487137Z 2025-05-07T20:26:24.0487142Z 2025-05-07T20:26:24.0487147Z 2025-05-07T20:26:24.0487152Z 2025-05-07T20:26:24.0487157Z 2025-05-07T20:26:24.0487162Z 2025-05-07T20:26:24.0487167Z 2025-05-07T20:26:24.0487366Z  2025-05-07T20:26:24.0487622Z 2025-05-07T20:26:24.0487627Z 2025-05-07T20:26:24.0487632Z 2025-05-07T20:26:24.0487637Z 2025-05-07T20:26:24.0487642Z 2025-05-07T20:26:24.0487647Z 2025-05-07T20:26:24.0487652Z 2025-05-07T20:26:24.0487664Z 2025-05-07T20:26:24.0487670Z 2025-05-07T20:26:24.0487675Z 2025-05-07T20:26:24.0487680Z 2025-05-07T20:26:24.0487685Z 2025-05-07T20:26:24.0487690Z 2025-05-07T20:26:24.0487695Z 2025-05-07T20:26:24.0487701Z 2025-05-07T20:26:24.0487896Z  2025-05-07T20:26:24.0488168Z 2025-05-07T20:26:24.0488184Z 2025-05-07T20:26:24.0488189Z 2025-05-07T20:26:24.0488194Z 2025-05-07T20:26:24.0488200Z 2025-05-07T20:26:24.0488205Z 2025-05-07T20:26:24.0488210Z 2025-05-07T20:26:24.0488215Z 2025-05-07T20:26:24.0488220Z 2025-05-07T20:26:24.0488225Z 2025-05-07T20:26:24.0488230Z 2025-05-07T20:26:24.0488235Z 2025-05-07T20:26:24.0488240Z 2025-05-07T20:26:24.0488245Z 2025-05-07T20:26:24.0488250Z 2025-05-07T20:26:24.0488256Z 2025-05-07T20:26:24.0488466Z  2025-05-07T20:26:24.0488716Z 2025-05-07T20:26:24.0488720Z 2025-05-07T20:26:24.0488724Z 2025-05-07T20:26:24.0488727Z 2025-05-07T20:26:24.0488731Z 2025-05-07T20:26:24.0488735Z 2025-05-07T20:26:24.0488738Z 2025-05-07T20:26:24.0488742Z 2025-05-07T20:26:24.0488745Z 2025-05-07T20:26:24.0488749Z 2025-05-07T20:26:24.0488753Z 2025-05-07T20:26:24.0488756Z 2025-05-07T20:26:24.0488760Z 2025-05-07T20:26:24.0488764Z 2025-05-07T20:26:24.0488773Z 2025-05-07T20:26:24.0488777Z 2025-05-07T20:26:24.0488870Z 2025-05-07T20:26:24.0489109Z  2025-05-07T20:26:24.0489313Z 2025-05-07T20:26:24.0489316Z 2025-05-07T20:26:24.0489320Z 2025-05-07T20:26:24.0489324Z 2025-05-07T20:26:24.0489327Z 2025-05-07T20:26:24.0489337Z 2025-05-07T20:26:24.0489340Z 2025-05-07T20:26:24.0489344Z 2025-05-07T20:26:24.0489347Z 2025-05-07T20:26:24.0489351Z 2025-05-07T20:26:24.0489355Z 2025-05-07T20:26:24.0489358Z 2025-05-07T20:26:24.0489362Z 2025-05-07T20:26:24.0489365Z 2025-05-07T20:26:24.0489369Z 2025-05-07T20:26:24.0489372Z 2025-05-07T20:26:24.0489376Z 2025-05-07T20:26:24.0489379Z 2025-05-07T20:26:24.0489539Z  2025-05-07T20:26:24.0489749Z 2025-05-07T20:26:24.0489752Z 2025-05-07T20:26:24.0489852Z  2025-05-07T20:26:24.0489960Z 2025-05-07T20:26:24.0489964Z 2025-05-07T20:26:24.0490066Z  2025-05-07T20:26:24.0490168Z 2025-05-07T20:26:24.0490172Z 2025-05-07T20:26:24.0490175Z 2025-05-07T20:26:24.0490294Z  2025-05-07T20:26:24.0490409Z 2025-05-07T20:26:24.0490413Z 2025-05-07T20:26:24.0490416Z 2025-05-07T20:26:24.0490420Z 2025-05-07T20:26:24.0490530Z  2025-05-07T20:26:24.0490643Z 2025-05-07T20:26:24.0490647Z 2025-05-07T20:26:24.0490650Z 2025-05-07T20:26:24.0490654Z 2025-05-07T20:26:24.0490658Z 2025-05-07T20:26:24.0490765Z  2025-05-07T20:26:24.0490891Z 2025-05-07T20:26:24.0490895Z 2025-05-07T20:26:24.0490898Z 2025-05-07T20:26:24.0490902Z 2025-05-07T20:26:24.0490906Z 2025-05-07T20:26:24.0490909Z 2025-05-07T20:26:24.0491020Z  2025-05-07T20:26:24.0491150Z 2025-05-07T20:26:24.0491153Z 2025-05-07T20:26:24.0491157Z 2025-05-07T20:26:24.0491161Z 2025-05-07T20:26:24.0491164Z 2025-05-07T20:26:24.0491168Z 2025-05-07T20:26:24.0491172Z 2025-05-07T20:26:24.0491335Z  2025-05-07T20:26:24.0491533Z 2025-05-07T20:26:24.0491538Z 2025-05-07T20:26:24.0491543Z 2025-05-07T20:26:24.0491547Z 2025-05-07T20:26:24.0491552Z 2025-05-07T20:26:24.0491566Z 2025-05-07T20:26:24.0491578Z 2025-05-07T20:26:24.0491583Z 2025-05-07T20:26:24.0491749Z  2025-05-07T20:26:24.0491960Z 2025-05-07T20:26:24.0491965Z 2025-05-07T20:26:24.0491970Z 2025-05-07T20:26:24.0491976Z 2025-05-07T20:26:24.0491981Z 2025-05-07T20:26:24.0491986Z 2025-05-07T20:26:24.0491991Z 2025-05-07T20:26:24.0491996Z 2025-05-07T20:26:24.0492001Z 2025-05-07T20:26:24.0492166Z  2025-05-07T20:26:24.0492380Z 2025-05-07T20:26:24.0492385Z 2025-05-07T20:26:24.0492390Z 2025-05-07T20:26:24.0492396Z 2025-05-07T20:26:24.0492401Z 2025-05-07T20:26:24.0492406Z 2025-05-07T20:26:24.0492411Z 2025-05-07T20:26:24.0492416Z 2025-05-07T20:26:24.0492421Z 2025-05-07T20:26:24.0492426Z 2025-05-07T20:26:24.0492563Z  2025-05-07T20:26:24.0492723Z 2025-05-07T20:26:24.0492727Z 2025-05-07T20:26:24.0492730Z 2025-05-07T20:26:24.0492734Z 2025-05-07T20:26:24.0492738Z 2025-05-07T20:26:24.0492741Z 2025-05-07T20:26:24.0492745Z 2025-05-07T20:26:24.0492757Z 2025-05-07T20:26:24.0492760Z 2025-05-07T20:26:24.0492764Z 2025-05-07T20:26:24.0492768Z 2025-05-07T20:26:24.0492905Z  2025-05-07T20:26:24.0493074Z 2025-05-07T20:26:24.0493078Z 2025-05-07T20:26:24.0493081Z 2025-05-07T20:26:24.0493085Z 2025-05-07T20:26:24.0493088Z 2025-05-07T20:26:24.0493092Z 2025-05-07T20:26:24.0493095Z 2025-05-07T20:26:24.0493099Z 2025-05-07T20:26:24.0493103Z 2025-05-07T20:26:24.0493106Z 2025-05-07T20:26:24.0493110Z 2025-05-07T20:26:24.0493118Z 2025-05-07T20:26:24.0493252Z  2025-05-07T20:26:24.0493428Z 2025-05-07T20:26:24.0493432Z 2025-05-07T20:26:24.0493436Z 2025-05-07T20:26:24.0493439Z 2025-05-07T20:26:24.0493443Z 2025-05-07T20:26:24.0493446Z 2025-05-07T20:26:24.0493450Z 2025-05-07T20:26:24.0493460Z 2025-05-07T20:26:24.0493463Z 2025-05-07T20:26:24.0493467Z 2025-05-07T20:26:24.0493470Z 2025-05-07T20:26:24.0493474Z 2025-05-07T20:26:24.0493477Z 2025-05-07T20:26:24.0493829Z  2025-05-07T20:26:24.0494090Z 2025-05-07T20:26:24.0494101Z 2025-05-07T20:26:24.0494105Z 2025-05-07T20:26:24.0494109Z 2025-05-07T20:26:24.0494112Z 2025-05-07T20:26:24.0494116Z 2025-05-07T20:26:24.0494119Z 2025-05-07T20:26:24.0494123Z 2025-05-07T20:26:24.0494126Z 2025-05-07T20:26:24.0494130Z 2025-05-07T20:26:24.0494134Z 2025-05-07T20:26:24.0494137Z 2025-05-07T20:26:24.0494141Z 2025-05-07T20:26:24.0494144Z 2025-05-07T20:26:24.0494290Z  2025-05-07T20:26:24.0494484Z 2025-05-07T20:26:24.0494488Z 2025-05-07T20:26:24.0494491Z 2025-05-07T20:26:24.0494495Z 2025-05-07T20:26:24.0494499Z 2025-05-07T20:26:24.0494502Z 2025-05-07T20:26:24.0494506Z 2025-05-07T20:26:24.0494509Z 2025-05-07T20:26:24.0494513Z 2025-05-07T20:26:24.0494517Z 2025-05-07T20:26:24.0494520Z 2025-05-07T20:26:24.0494524Z 2025-05-07T20:26:24.0494528Z 2025-05-07T20:26:24.0494531Z 2025-05-07T20:26:24.0494535Z 2025-05-07T20:26:24.0494695Z  2025-05-07T20:26:24.0494894Z 2025-05-07T20:26:24.0494897Z 2025-05-07T20:26:24.0494901Z 2025-05-07T20:26:24.0494904Z 2025-05-07T20:26:24.0494908Z 2025-05-07T20:26:24.0494911Z 2025-05-07T20:26:24.0494915Z 2025-05-07T20:26:24.0494919Z 2025-05-07T20:26:24.0494922Z 2025-05-07T20:26:24.0494926Z 2025-05-07T20:26:24.0494929Z 2025-05-07T20:26:24.0494944Z 2025-05-07T20:26:24.0494950Z 2025-05-07T20:26:24.0494954Z 2025-05-07T20:26:24.0494957Z 2025-05-07T20:26:24.0494961Z 2025-05-07T20:26:24.0495112Z  2025-05-07T20:26:24.0495312Z 2025-05-07T20:26:24.0495315Z 2025-05-07T20:26:24.0495326Z 2025-05-07T20:26:24.0495330Z 2025-05-07T20:26:24.0495333Z 2025-05-07T20:26:24.0495337Z 2025-05-07T20:26:24.0495341Z 2025-05-07T20:26:24.0495344Z 2025-05-07T20:26:24.0495348Z 2025-05-07T20:26:24.0495351Z 2025-05-07T20:26:24.0495355Z 2025-05-07T20:26:24.0495358Z 2025-05-07T20:26:24.0495362Z 2025-05-07T20:26:24.0495366Z 2025-05-07T20:26:24.0495378Z 2025-05-07T20:26:24.0495382Z 2025-05-07T20:26:24.0495385Z 2025-05-07T20:26:24.0495539Z  2025-05-07T20:26:24.0495753Z 2025-05-07T20:26:24.0495757Z 2025-05-07T20:26:24.0495760Z 2025-05-07T20:26:24.0495764Z 2025-05-07T20:26:24.0495767Z 2025-05-07T20:26:24.0495771Z 2025-05-07T20:26:24.0495774Z 2025-05-07T20:26:24.0495778Z 2025-05-07T20:26:24.0495782Z 2025-05-07T20:26:24.0495785Z 2025-05-07T20:26:24.0495789Z 2025-05-07T20:26:24.0495792Z 2025-05-07T20:26:24.0495796Z 2025-05-07T20:26:24.0495799Z 2025-05-07T20:26:24.0495803Z 2025-05-07T20:26:24.0495806Z 2025-05-07T20:26:24.0495816Z 2025-05-07T20:26:24.0495820Z 2025-05-07T20:26:24.0495980Z  2025-05-07T20:26:24.0496188Z 2025-05-07T20:26:24.0496191Z 2025-05-07T20:26:24.0496296Z  2025-05-07T20:26:24.0496397Z 2025-05-07T20:26:24.0496400Z 2025-05-07T20:26:24.0496502Z  2025-05-07T20:26:24.0496612Z 2025-05-07T20:26:24.0496621Z 2025-05-07T20:26:24.0496630Z 2025-05-07T20:26:24.0496735Z  2025-05-07T20:26:24.0496849Z 2025-05-07T20:26:24.0496852Z 2025-05-07T20:26:24.0496856Z 2025-05-07T20:26:24.0496860Z 2025-05-07T20:26:24.0496965Z  2025-05-07T20:26:24.0497080Z 2025-05-07T20:26:24.0497083Z 2025-05-07T20:26:24.0497095Z 2025-05-07T20:26:24.0497099Z 2025-05-07T20:26:24.0497102Z 2025-05-07T20:26:24.0497211Z  2025-05-07T20:26:24.0497332Z 2025-05-07T20:26:24.0497336Z 2025-05-07T20:26:24.0497340Z 2025-05-07T20:26:24.0497343Z 2025-05-07T20:26:24.0497352Z 2025-05-07T20:26:24.0497355Z 2025-05-07T20:26:24.0497465Z  2025-05-07T20:26:24.0497590Z 2025-05-07T20:26:24.0497594Z 2025-05-07T20:26:24.0497597Z 2025-05-07T20:26:24.0497601Z 2025-05-07T20:26:24.0497604Z 2025-05-07T20:26:24.0497608Z 2025-05-07T20:26:24.0497618Z 2025-05-07T20:26:24.0497732Z  2025-05-07T20:26:24.0497871Z 2025-05-07T20:26:24.0497874Z 2025-05-07T20:26:24.0497878Z 2025-05-07T20:26:24.0498099Z 2025-05-07T20:26:24.0498103Z 2025-05-07T20:26:24.0498107Z 2025-05-07T20:26:24.0498110Z 2025-05-07T20:26:24.0498120Z 2025-05-07T20:26:24.0498659Z  2025-05-07T20:26:24.0498832Z 2025-05-07T20:26:24.0498835Z 2025-05-07T20:26:24.0498839Z 2025-05-07T20:26:24.0498842Z 2025-05-07T20:26:24.0498846Z 2025-05-07T20:26:24.0498849Z 2025-05-07T20:26:24.0498852Z 2025-05-07T20:26:24.0498856Z 2025-05-07T20:26:24.0498859Z 2025-05-07T20:26:24.0498988Z  2025-05-07T20:26:24.0499168Z 2025-05-07T20:26:24.0499172Z 2025-05-07T20:26:24.0499175Z 2025-05-07T20:26:24.0499179Z 2025-05-07T20:26:24.0499182Z 2025-05-07T20:26:24.0499185Z 2025-05-07T20:26:24.0499189Z 2025-05-07T20:26:24.0499192Z 2025-05-07T20:26:24.0499196Z 2025-05-07T20:26:24.0499199Z 2025-05-07T20:26:24.0499339Z  2025-05-07T20:26:24.0499519Z 2025-05-07T20:26:24.0499522Z 2025-05-07T20:26:24.0499526Z 2025-05-07T20:26:24.0499529Z 2025-05-07T20:26:24.0499540Z 2025-05-07T20:26:24.0499550Z 2025-05-07T20:26:24.0499553Z 2025-05-07T20:26:24.0499556Z 2025-05-07T20:26:24.0499560Z 2025-05-07T20:26:24.0499563Z 2025-05-07T20:26:24.0499567Z 2025-05-07T20:26:24.0499711Z  2025-05-07T20:26:24.0499900Z 2025-05-07T20:26:24.0499903Z 2025-05-07T20:26:24.0499907Z 2025-05-07T20:26:24.0499910Z 2025-05-07T20:26:24.0499913Z 2025-05-07T20:26:24.0499917Z 2025-05-07T20:26:24.0499920Z 2025-05-07T20:26:24.0499924Z 2025-05-07T20:26:24.0499927Z 2025-05-07T20:26:24.0499930Z 2025-05-07T20:26:24.0499942Z 2025-05-07T20:26:24.0499945Z 2025-05-07T20:26:24.0500081Z  2025-05-07T20:26:24.0500280Z 2025-05-07T20:26:24.0500284Z 2025-05-07T20:26:24.0500287Z 2025-05-07T20:26:24.0500290Z 2025-05-07T20:26:24.0500294Z 2025-05-07T20:26:24.0500304Z 2025-05-07T20:26:24.0500307Z 2025-05-07T20:26:24.0500311Z 2025-05-07T20:26:24.0500314Z 2025-05-07T20:26:24.0500318Z 2025-05-07T20:26:24.0500321Z 2025-05-07T20:26:24.0500328Z 2025-05-07T20:26:24.0500337Z 2025-05-07T20:26:24.0500477Z  2025-05-07T20:26:24.0500690Z 2025-05-07T20:26:24.0500693Z 2025-05-07T20:26:24.0500696Z 2025-05-07T20:26:24.0500700Z 2025-05-07T20:26:24.0500703Z 2025-05-07T20:26:24.0500707Z 2025-05-07T20:26:24.0500710Z 2025-05-07T20:26:24.0500713Z 2025-05-07T20:26:24.0500717Z 2025-05-07T20:26:24.0500720Z 2025-05-07T20:26:24.0500723Z 2025-05-07T20:26:24.0500727Z 2025-05-07T20:26:24.0500730Z 2025-05-07T20:26:24.0500734Z 2025-05-07T20:26:24.0500880Z  2025-05-07T20:26:24.0501101Z 2025-05-07T20:26:24.0501105Z 2025-05-07T20:26:24.0501108Z 2025-05-07T20:26:24.0501112Z 2025-05-07T20:26:24.0501115Z 2025-05-07T20:26:24.0501119Z 2025-05-07T20:26:24.0501122Z 2025-05-07T20:26:24.0501125Z 2025-05-07T20:26:24.0501129Z 2025-05-07T20:26:24.0501132Z 2025-05-07T20:26:24.0501136Z 2025-05-07T20:26:24.0501139Z 2025-05-07T20:26:24.0501143Z 2025-05-07T20:26:24.0501146Z 2025-05-07T20:26:24.0501160Z 2025-05-07T20:26:24.0501345Z  2025-05-07T20:26:24.0501585Z 2025-05-07T20:26:24.0501589Z 2025-05-07T20:26:24.0501592Z 2025-05-07T20:26:24.0501596Z 2025-05-07T20:26:24.0501599Z 2025-05-07T20:26:24.0501603Z 2025-05-07T20:26:24.0501606Z 2025-05-07T20:26:24.0501609Z 2025-05-07T20:26:24.0501613Z 2025-05-07T20:26:24.0501623Z 2025-05-07T20:26:24.0501627Z 2025-05-07T20:26:24.0501630Z 2025-05-07T20:26:24.0501634Z 2025-05-07T20:26:24.0501637Z 2025-05-07T20:26:24.0501641Z 2025-05-07T20:26:24.0501644Z 2025-05-07T20:26:24.0501804Z  2025-05-07T20:26:24.0502035Z 2025-05-07T20:26:24.0502038Z 2025-05-07T20:26:24.0502042Z 2025-05-07T20:26:24.0502046Z 2025-05-07T20:26:24.0502049Z 2025-05-07T20:26:24.0502053Z 2025-05-07T20:26:24.0502056Z 2025-05-07T20:26:24.0502060Z 2025-05-07T20:26:24.0502063Z 2025-05-07T20:26:24.0502067Z 2025-05-07T20:26:24.0502070Z 2025-05-07T20:26:24.0502074Z 2025-05-07T20:26:24.0502335Z 2025-05-07T20:26:24.0502340Z 2025-05-07T20:26:24.0502343Z 2025-05-07T20:26:24.0502347Z 2025-05-07T20:26:24.0502351Z 2025-05-07T20:26:24.0502520Z  2025-05-07T20:26:24.0502723Z 2025-05-07T20:26:24.0502727Z 2025-05-07T20:26:24.0502731Z 2025-05-07T20:26:24.0502734Z 2025-05-07T20:26:24.0502738Z 2025-05-07T20:26:24.0502741Z 2025-05-07T20:26:24.0502745Z 2025-05-07T20:26:24.0502748Z 2025-05-07T20:26:24.0502752Z 2025-05-07T20:26:24.0502756Z 2025-05-07T20:26:24.0502759Z 2025-05-07T20:26:24.0502763Z 2025-05-07T20:26:24.0502766Z 2025-05-07T20:26:24.0502770Z 2025-05-07T20:26:24.0502781Z 2025-05-07T20:26:24.0502785Z 2025-05-07T20:26:24.0502788Z 2025-05-07T20:26:24.0502792Z 2025-05-07T20:26:24.0502953Z  2025-05-07T20:26:24.0503155Z 2025-05-07T20:26:24.0503159Z 2025-05-07T20:26:24.0503266Z  2025-05-07T20:26:24.0503368Z 2025-05-07T20:26:24.0503372Z 2025-05-07T20:26:24.0503486Z  2025-05-07T20:26:24.0503596Z 2025-05-07T20:26:24.0503600Z 2025-05-07T20:26:24.0503603Z 2025-05-07T20:26:24.0503704Z  2025-05-07T20:26:24.0503818Z 2025-05-07T20:26:24.0503821Z 2025-05-07T20:26:24.0503825Z 2025-05-07T20:26:24.0503829Z 2025-05-07T20:26:24.0503934Z  2025-05-07T20:26:24.0504055Z 2025-05-07T20:26:24.0504058Z 2025-05-07T20:26:24.0504062Z 2025-05-07T20:26:24.0504066Z 2025-05-07T20:26:24.0504069Z 2025-05-07T20:26:24.0504178Z  2025-05-07T20:26:24.0504297Z 2025-05-07T20:26:24.0504301Z 2025-05-07T20:26:24.0504310Z 2025-05-07T20:26:24.0504314Z 2025-05-07T20:26:24.0504318Z 2025-05-07T20:26:24.0504321Z 2025-05-07T20:26:24.0504434Z  2025-05-07T20:26:24.0504559Z 2025-05-07T20:26:24.0504562Z 2025-05-07T20:26:24.0504566Z 2025-05-07T20:26:24.0504570Z 2025-05-07T20:26:24.0504580Z 2025-05-07T20:26:24.0504583Z 2025-05-07T20:26:24.0504587Z 2025-05-07T20:26:24.0504699Z  2025-05-07T20:26:24.0504834Z 2025-05-07T20:26:24.0504848Z 2025-05-07T20:26:24.0504851Z 2025-05-07T20:26:24.0504855Z 2025-05-07T20:26:24.0504858Z 2025-05-07T20:26:24.0504868Z 2025-05-07T20:26:24.0504872Z 2025-05-07T20:26:24.0504875Z 2025-05-07T20:26:24.0504993Z  2025-05-07T20:26:24.0505137Z 2025-05-07T20:26:24.0505141Z 2025-05-07T20:26:24.0505144Z 2025-05-07T20:26:24.0505148Z 2025-05-07T20:26:24.0505162Z 2025-05-07T20:26:24.0505165Z 2025-05-07T20:26:24.0505169Z 2025-05-07T20:26:24.0505173Z 2025-05-07T20:26:24.0505176Z 2025-05-07T20:26:24.0505300Z  2025-05-07T20:26:24.0505449Z 2025-05-07T20:26:24.0505452Z 2025-05-07T20:26:24.0505456Z 2025-05-07T20:26:24.0505466Z 2025-05-07T20:26:24.0505470Z 2025-05-07T20:26:24.0505473Z 2025-05-07T20:26:24.0505477Z 2025-05-07T20:26:24.0505481Z 2025-05-07T20:26:24.0505484Z 2025-05-07T20:26:24.0505488Z 2025-05-07T20:26:24.0505613Z  2025-05-07T20:26:24.0505770Z 2025-05-07T20:26:24.0505780Z 2025-05-07T20:26:24.0505792Z 2025-05-07T20:26:24.0505795Z 2025-05-07T20:26:24.0505799Z 2025-05-07T20:26:24.0505803Z 2025-05-07T20:26:24.0505806Z 2025-05-07T20:26:24.0505810Z 2025-05-07T20:26:24.0505813Z 2025-05-07T20:26:24.0505817Z 2025-05-07T20:26:24.0505821Z 2025-05-07T20:26:24.0505954Z  done 2025-05-07T20:26:24.3553116Z Preparing transaction: - \ | done 2025-05-07T20:26:30.8871528Z Verifying transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:26:31.8070720Z Executing transaction: \ | / - \ | / - \ done 2025-05-07T20:26:34.4210787Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:34.4211186Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:34.4212247Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:34.4212955Z 2025-05-07T20:26:34.4225569Z 2025-05-07T20:26:34.4226431Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:34.4227136Z 2025-05-07T20:26:34.4238785Z 2025-05-07T20:26:34.4239067Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:34.4243952Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:34.4247676Z 2025-05-07T20:26:34.5823560Z 2025-05-07T20:26:34.5828655Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:34.5832466Z 2025-05-07T20:26:34.5851226Z 2025-05-07T20:26:34.5851598Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:34.6226517Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:36.5234066Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:36.5888611Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:36.5889166Z 2025-05-07T20:26:37.0162374Z 2025-05-07T20:26:37.0171148Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:37.0522798Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:37.0523395Z 2025-05-07T20:26:37.4925662Z 2025-05-07T20:26:37.4926030Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:37.4927329Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:37.4928024Z 2025-05-07T20:26:37.9199729Z 2025-05-07T20:26:39.9670449Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:42.0022576Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:44.0453751Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:44.0454548Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:46.0840753Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:48.0003345Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:48.0003782Z 2025-05-07T20:26:48.0642407Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:26:51.9629159Z /tmp/tmp6ugsjzsc: line 3: clang: command not found 2025-05-07T20:26:51.9629437Z 2025-05-07T20:26:51.9630378Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:26:52.0328275Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:26:52.0328654Z 2025-05-07T20:26:52.0350880Z total 36 2025-05-07T20:26:52.0351171Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:26:52.0351599Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:26:52.0352040Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:26:52.0352538Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:26:52.0353007Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:26:52.0353457Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:26:52.0353890Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:26:52.0354331Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:26:52.0354608Z 2025-05-07T20:26:52.0354820Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:26:52.0355442Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:26:52.0355861Z 2025-05-07T20:26:52.0374483Z 2025-05-07T20:26:52.0375057Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:26:52.0375336Z 2025-05-07T20:26:54.0335900Z 2025-05-07T20:26:54.0336692Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:26:54.0337393Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:26:54.0337765Z 2025-05-07T20:26:54.4759990Z 2025-05-07T20:26:54.4760453Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:26:54.4760715Z 2025-05-07T20:26:56.3758971Z -allow-unsupported-compiler 2025-05-07T20:26:56.3759188Z 2025-05-07T20:26:56.4405877Z 2025-05-07T20:26:56.4406543Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:26:56.4407221Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:26:58.4135422Z 2025-05-07T20:26:58.4136160Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:26:58.4136894Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:26:58.4137239Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:26:58.4137560Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:26:58.4137881Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:26:58.4138166Z #define _STL_PAIR_H 1 2025-05-07T20:26:58.4138467Z #define __cpp_attributes 200809L 2025-05-07T20:26:58.4138935Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:26:58.4139383Z #define __DELETE_THROW throw() 2025-05-07T20:26:58.4139640Z #define _PTRDIFF_T_ 2025-05-07T20:26:58.4139878Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:26:58.4140248Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:26:58.4140597Z #define _IO_LEFT 02 2025-05-07T20:26:58.4140850Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:26:58.4141196Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:26:58.4141934Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:26:58.4144180Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:26:58.4144596Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:26:58.4144868Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:26:58.4145113Z #define _IOS_OUTPUT 2 2025-05-07T20:26:58.4145470Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:26:58.4145975Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:26:58.4146386Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:26:58.4146704Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:26:58.4146970Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:26:58.4147717Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:26:58.4148665Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:26:58.4149216Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:26:58.4149511Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:26:58.4149812Z #define _T_WCHAR_ 2025-05-07T20:26:58.4150031Z #define stdout stdout 2025-05-07T20:26:58.4150347Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:26:58.4150720Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:26:58.4150967Z #define __flexarr [] 2025-05-07T20:26:58.4151192Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:26:58.4151503Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:26:58.4151840Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:26:58.4152079Z #define _MATH_H 1 2025-05-07T20:26:58.4152350Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:26:58.4152683Z #define __S64_TYPE long int 2025-05-07T20:26:58.4152923Z #define __stub_fchflags 2025-05-07T20:26:58.4153185Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:26:58.4153468Z #define __SQUAD_TYPE long int 2025-05-07T20:26:58.4153727Z #define __INTMAX_C(c) c ## L 2025-05-07T20:26:58.4153987Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:26:58.4154240Z #define NL_NMAX INT_MAX 2025-05-07T20:26:58.4154474Z #define _BITS_TIME_H 1 2025-05-07T20:26:58.4154734Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:26:58.4155056Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:26:58.4155353Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:26:58.4155692Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:26:58.4156082Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:26:58.4156441Z #define __CHAR_BIT__ 8 2025-05-07T20:26:58.4156690Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4157000Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:26:58.4157288Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:26:58.4157545Z #define FP_NAN 0 2025-05-07T20:26:58.4157804Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:26:58.4158238Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:26:58.4158723Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:26:58.4159098Z #define __cudaCDP2GetErrorString 2025-05-07T20:26:58.4159381Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:26:58.4159638Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:26:58.4159883Z #define __SM_80_RT_H__ 2025-05-07T20:26:58.4160107Z #define _NEW 2025-05-07T20:26:58.4160326Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:26:58.4160593Z #define __UINT8_MAX__ 0xff 2025-05-07T20:26:58.4160949Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:26:58.4161355Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.4161625Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:26:58.4161860Z #define __USE_ANSI 1 2025-05-07T20:26:58.4162137Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:26:58.4162518Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:26:58.4162976Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:26:58.4163352Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:26:58.4163624Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:26:58.4163892Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:26:58.4164163Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:26:58.4164439Z #define PIPE_BUF 4096 2025-05-07T20:26:58.4164743Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:26:58.4165093Z #define ADJ_TICK 0x4000 2025-05-07T20:26:58.4165362Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:26:58.4165666Z #define MQ_PRIO_MAX 32768 2025-05-07T20:26:58.4165922Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:26:58.4166234Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:26:58.4166687Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:58.4167195Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:26:58.4167555Z #define _XOPEN_SOURCE 700 2025-05-07T20:26:58.4167820Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:26:58.4168083Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4168363Z #define __cpp_static_assert 201411L 2025-05-07T20:26:58.4168690Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:26:58.4169021Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:26:58.4169300Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:26:58.4169577Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:26:58.4169871Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:26:58.4170144Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:26:58.4170439Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4170791Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:26:58.4171121Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:26:58.4171397Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:26:58.4171705Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4172055Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:26:58.4172409Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:26:58.4172703Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:26:58.4172986Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:26:58.4173307Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:26:58.4173822Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:26:58.4174263Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:26:58.4174665Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:26:58.4174967Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:26:58.4175231Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:26:58.4175506Z #define __GCC_IEC_559 2 2025-05-07T20:26:58.4175790Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:26:58.4176124Z #define _IO_flockfile(_fp) 2025-05-07T20:26:58.4176374Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:26:58.4176636Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:26:58.4176893Z #define _IOFBF 0 2025-05-07T20:26:58.4177110Z #define __USE_BSD 1 2025-05-07T20:26:58.4177332Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:26:58.4177596Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:26:58.4177860Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:26:58.4178108Z #define _IO_NO_WRITES 8 2025-05-07T20:26:58.4178364Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:26:58.4178705Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:26:58.4179050Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:26:58.4179350Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:26:58.4179671Z #define __cpp_binary_literals 201304L 2025-05-07T20:26:58.4179949Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:26:58.4180206Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:26:58.4180470Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:26:58.4180771Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:26:58.4181150Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:26:58.4181610Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:26:58.4181992Z #define M_PI 3.14159265358979323846 2025-05-07T20:26:58.4182294Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:26:58.4182615Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:26:58.4182910Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:26:58.4183207Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:26:58.4183499Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:26:58.4183762Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:26:58.4184327Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:26:58.4184897Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:26:58.4185215Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:26:58.4185527Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:26:58.4185822Z #define __cudaCDP2GetErrorName 2025-05-07T20:26:58.4186099Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:26:58.4186361Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:26:58.4186656Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:26:58.4186979Z #define __cpp_variadic_templates 200704L 2025-05-07T20:26:58.4187272Z #define RAND_MAX 2147483647 2025-05-07T20:26:58.4187525Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:26:58.4250688Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4251136Z #define __SM_90_RT_H__ 2025-05-07T20:26:58.4251387Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:26:58.4251641Z #define __COMPAR_FN_T 2025-05-07T20:26:58.4251888Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.4252159Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:26:58.4252623Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:26:58.4253137Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.4253479Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.4254009Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:26:58.4254314Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:26:58.4254655Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:26:58.4254970Z #define __cpp_variable_templates 201304L 2025-05-07T20:26:58.4255472Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:58.4256006Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:26:58.4256339Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:26:58.4256603Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:26:58.4256899Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:26:58.4257201Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:26:58.4257460Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:26:58.4257728Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:26:58.4257988Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:26:58.4258230Z #define __u_char_defined 2025-05-07T20:26:58.4258545Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:26:58.4258904Z #define STA_PPSERROR 0x0800 2025-05-07T20:26:58.4259167Z #define _GLIBCXX_STD_A std 2025-05-07T20:26:58.4259413Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:26:58.4259694Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:26:58.4260128Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:26:58.4260555Z #define FP_INFINITE 1 2025-05-07T20:26:58.4260952Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.4261364Z #define _IO_pid_t __pid_t 2025-05-07T20:26:58.4261607Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:26:58.4261871Z #define __LEAF , __leaf__ 2025-05-07T20:26:58.4262112Z #define PATH_MAX 4096 2025-05-07T20:26:58.4262354Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:26:58.4262685Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:26:58.4263007Z #define _LIMITS_H___ 2025-05-07T20:26:58.4263223Z #define __size_t 2025-05-07T20:26:58.4263450Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:26:58.4264481Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:26:58.4265040Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:26:58.4265339Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:26:58.4265667Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:26:58.4265926Z #define _WCHAR_T_DEFINED 2025-05-07T20:26:58.4266270Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:26:58.4266668Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:26:58.4266964Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:26:58.4267283Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:26:58.4267565Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:26:58.4267845Z #define __INT8_C(c) c 2025-05-07T20:26:58.4268103Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:26:58.4268395Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:26:58.4268660Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:26:58.4268931Z #define __SM_70_RT_HPP__ 2025-05-07T20:26:58.4269173Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:26:58.4269456Z #define __cpp_variadic_using 201611L 2025-05-07T20:26:58.4269764Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4270086Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:26:58.4270352Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:26:58.4270616Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:26:58.4270871Z #define __cpp_capture_star_this 201603L 2025-05-07T20:26:58.4271183Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:26:58.4271481Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:26:58.4271833Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:26:58.4272205Z #define NFDBITS __NFDBITS 2025-05-07T20:26:58.4272460Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:26:58.4272743Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:26:58.4273060Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:26:58.4273381Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:26:58.4273635Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:26:58.4273917Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:26:58.4274219Z #define STA_UNSYNC 0x0040 2025-05-07T20:26:58.4274519Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:58.4274926Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:26:58.4275281Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:26:58.4275564Z #define __cpp_if_constexpr 201606L 2025-05-07T20:26:58.4275867Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:26:58.4276229Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:26:58.4276564Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:26:58.4276867Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:26:58.4277195Z #define __daddr_t_defined 2025-05-07T20:26:58.4277442Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:26:58.4277705Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:26:58.4278027Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:26:58.4278529Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:26:58.4279013Z #define _ACRTIMP 2025-05-07T20:26:58.4279229Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:26:58.4279490Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:26:58.4279774Z #define _IOS_BIN 128 2025-05-07T20:26:58.4280111Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:26:58.4280518Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4280782Z #define UNDERFLOW 4 2025-05-07T20:26:58.4280991Z #define NAME_MAX 255 2025-05-07T20:26:58.4281225Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:26:58.4281490Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:26:58.4281759Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:26:58.4282050Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:26:58.4282526Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:26:58.4282976Z #define __ptr_t void * 2025-05-07T20:26:58.4283210Z #define M_E 2.7182818284590452354 2025-05-07T20:26:58.4283483Z #define cudaSurfaceType1D 0x01 2025-05-07T20:26:58.4283744Z #define __USE_ISOCXX11 1 2025-05-07T20:26:58.4284000Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:26:58.4284312Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:26:58.4284601Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:26:58.4284866Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:26:58.4285146Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:26:58.4285456Z #define cudaSurfaceType2D 0x02 2025-05-07T20:26:58.4285708Z #define __linux 1 2025-05-07T20:26:58.4285938Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:26:58.4286210Z #define cudaDeviceMask 0xff 2025-05-07T20:26:58.4286468Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:26:58.4286758Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:26:58.4287032Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:26:58.4287318Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:26:58.4287626Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:26:58.4287926Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:26:58.4288218Z #define _BITS_TYPES_H 1 2025-05-07T20:26:58.4288494Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:26:58.4288829Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:26:58.4289126Z #define cudaSurfaceType3D 0x03 2025-05-07T20:26:58.4289395Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:26:58.4289678Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:26:58.4289963Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:26:58.4290766Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:26:58.4291570Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:26:58.4291849Z #define __unix 1 2025-05-07T20:26:58.4292070Z #define MATH_ERRNO 1 2025-05-07T20:26:58.4292306Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:26:58.4292580Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:26:58.4292846Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:26:58.4293122Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:26:58.4293404Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.4293834Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:26:58.4294301Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:26:58.4294765Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:26:58.4295059Z #define CUDARTAPI_CDECL 2025-05-07T20:26:58.4295304Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:26:58.4295574Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:26:58.4295856Z #define __cpp_lib_void_t 201411 2025-05-07T20:26:58.4296109Z #define _POSIX_AIO_MAX 1 2025-05-07T20:26:58.4296346Z #define __SIZE_T 2025-05-07T20:26:58.4296595Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:26:58.4296922Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:26:58.4297210Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:26:58.4297468Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:26:58.4297726Z #define _ATFILE_SOURCE 1 2025-05-07T20:26:58.4298101Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:26:58.4299689Z #define __WAIT_STATUS void * 2025-05-07T20:26:58.4299953Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:26:58.4300210Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:26:58.4300476Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:26:58.4300792Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:26:58.4301074Z #define __WINT_MIN__ 0U 2025-05-07T20:26:58.4301641Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:26:58.4302274Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:26:58.4302573Z #define WUNTRACED 2 2025-05-07T20:26:58.4303189Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:26:58.4303465Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:26:58.4303745Z #define NZERO 20 2025-05-07T20:26:58.4303967Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:26:58.4304246Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:26:58.4304533Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:26:58.4304808Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:26:58.4305066Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.4305347Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:26:58.4305610Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:26:58.4305885Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:26:58.4306156Z #define EXIT_FAILURE 1 2025-05-07T20:26:58.4306393Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:26:58.4306648Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:26:58.4306912Z #define _SIZE_T_DEFINED_ 2025-05-07T20:26:58.4307160Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:26:58.4307429Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:26:58.4307771Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:26:58.4308124Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:26:58.4308403Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:26:58.4308652Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:26:58.4308920Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:26:58.4309202Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:26:58.4309501Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:26:58.4309782Z #define SEEK_DATA 3 2025-05-07T20:26:58.4310005Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:26:58.4310298Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:26:58.4310707Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:26:58.4311090Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:26:58.4311329Z #define __INT64_C(c) c ## L 2025-05-07T20:26:58.4311597Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:26:58.4311942Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:26:58.4312274Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:26:58.4312538Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:26:58.4312833Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:26:58.4313131Z #define STA_PPSWANDER 0x0400 2025-05-07T20:26:58.4313378Z #define __INT_WCHAR_T_H 2025-05-07T20:26:58.4313617Z #define WSTOPPED 2 2025-05-07T20:26:58.4313855Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:26:58.4314131Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:26:58.4314379Z #define FP_NORMAL 4 2025-05-07T20:26:58.4314618Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:26:58.4314884Z #define _BITS_TIMEX_H 1 2025-05-07T20:26:58.4315113Z #define _POSIX_LINK_MAX 8 2025-05-07T20:26:58.4315363Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:26:58.4315640Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:26:58.4315902Z #define cudaTextureType1D 0x01 2025-05-07T20:26:58.4316164Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:26:58.4316417Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:26:58.4316681Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:26:58.4316974Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:26:58.4317387Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:26:58.4317820Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:26:58.4318079Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:26:58.4318332Z #define _POSIX_SOURCE 1 2025-05-07T20:26:58.4318570Z #define cudaTextureType2D 0x02 2025-05-07T20:26:58.4318824Z #define _PTR_TRAITS_H 1 2025-05-07T20:26:58.4319085Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:26:58.4319382Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:26:58.4319641Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:26:58.4319954Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:26:58.4320279Z #define cudaTextureType3D 0x03 2025-05-07T20:26:58.4320536Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:26:58.4320788Z #define CLOCK_REALTIME 0 2025-05-07T20:26:58.4321159Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:26:58.4321535Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:26:58.4321836Z #define __cpp_aligned_new 201606L 2025-05-07T20:26:58.4322105Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:26:58.4322369Z #define cudaEventBlockingSync 0x01 2025-05-07T20:26:58.4322647Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:26:58.4322908Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:26:58.4323201Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:26:58.4323487Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:26:58.4323761Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:26:58.4324006Z #define __GLIBC__ 2 2025-05-07T20:26:58.4324210Z #define __END_DECLS } 2025-05-07T20:26:58.4324445Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:26:58.4324795Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:26:58.4325156Z #define __CONCAT(x,y) x ## y 2025-05-07T20:26:58.4325398Z #define WCONTINUED 8 2025-05-07T20:26:58.4325625Z #define __STDC_HOSTED__ 1 2025-05-07T20:26:58.4325884Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:26:58.4326149Z #define _ALLOCA_H 1 2025-05-07T20:26:58.4326373Z #define __host__ __location__(host) 2025-05-07T20:26:58.4326773Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:26:58.4327196Z #define __SLONG32_TYPE int 2025-05-07T20:26:58.4327451Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:26:58.4327723Z #define _SYS_SELECT_H 1 2025-05-07T20:26:58.4327955Z #define _IO_LINE_BUF 0x200 2025-05-07T20:26:58.4328203Z #define _IOS_NOCREATE 32 2025-05-07T20:26:58.4328451Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:26:58.4328719Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:26:58.4329004Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:26:58.4329283Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:26:58.4329554Z #define __global__ __location__(global) 2025-05-07T20:26:58.4329832Z #define __GNU_LIBRARY__ 6 2025-05-07T20:26:58.4330081Z #define __cpp_decltype_auto 201304L 2025-05-07T20:26:58.4330354Z #define __DBL_DIG__ 15 2025-05-07T20:26:58.4330575Z #define TIME_UTC 1 2025-05-07T20:26:58.4330788Z #define __FLT32_DIG__ 6 2025-05-07T20:26:58.4331095Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:26:58.4331479Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:26:58.4331787Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:26:58.4332081Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:26:58.4332369Z #define _G_BUFSIZ 8192 2025-05-07T20:26:58.4332660Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:26:58.4333016Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:26:58.4333297Z #define __cudaCDP2GetDevice 2025-05-07T20:26:58.4333568Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:26:58.4334015Z #define STA_CLOCKERR 0x1000 2025-05-07T20:26:58.4334276Z #define __GXX_WEAK__ 1 2025-05-07T20:26:58.4334549Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.4334883Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:26:58.4335167Z #define __SHRT_WIDTH__ 16 2025-05-07T20:26:58.4335488Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:26:58.4335866Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:26:58.4336163Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:26:58.4336472Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:26:58.4336802Z #define _G_config_h 1 2025-05-07T20:26:58.4337093Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:26:58.4337465Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:26:58.4337766Z #define _GCC_WCHAR_T 2025-05-07T20:26:58.4338016Z #define TMP_MAX 238328 2025-05-07T20:26:58.4338264Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:26:58.4338552Z #define __DEVICE_TYPES_H__ 2025-05-07T20:26:58.4338829Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.4339126Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:26:58.4339429Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:26:58.4339744Z #define _IO_SKIPWS 01 2025-05-07T20:26:58.4340324Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:26:58.4340939Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:26:58.4341232Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:26:58.4341594Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:26:58.4342003Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:26:58.4342411Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:26:58.4342818Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:26:58.4343088Z #define le32toh(x) (x) 2025-05-07T20:26:58.4343337Z #define _SIZE_T_DEFINED 2025-05-07T20:26:58.4343610Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:26:58.4343983Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:26:58.4344382Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:26:58.4344833Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:26:58.4345302Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:26:58.4345600Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:26:58.4345890Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:26:58.4346171Z #define _POSIX_NAME_MAX 14 2025-05-07T20:26:58.4346471Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:26:58.4347017Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:26:58.4347504Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:26:58.4347799Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:26:58.4348131Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:26:58.4348440Z #define _WCHAR_T_ 2025-05-07T20:26:58.4348653Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:26:58.4349005Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:26:58.4349378Z #define RTSIG_MAX 32 2025-05-07T20:26:58.4349588Z #define _STDDEF_H 2025-05-07T20:26:58.4349812Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:26:58.4350077Z #define _VA_LIST_DEFINED 2025-05-07T20:26:58.4350321Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:26:58.4350651Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:26:58.4351032Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:26:58.4351351Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:26:58.4351630Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:26:58.4352076Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:26:58.4352585Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:26:58.4352936Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:26:58.4353249Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:26:58.4353554Z #define __unix__ 1 2025-05-07T20:26:58.4353776Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.4354053Z #define __INT_WIDTH__ 32 2025-05-07T20:26:58.4354291Z #define __SIZEOF_LONG__ 8 2025-05-07T20:26:58.4354514Z #define _IONBF 2 2025-05-07T20:26:58.4354955Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:26:58.4355697Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:26:58.4356213Z #define __STDC_IEC_559__ 1 2025-05-07T20:26:58.4356458Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:26:58.4356722Z #define __UINT16_C(c) c 2025-05-07T20:26:58.4356963Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:26:58.4357225Z #define STA_DEL 0x0020 2025-05-07T20:26:58.4357460Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:26:58.4357709Z #define __id_t_defined 2025-05-07T20:26:58.4357966Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:26:58.4358404Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:26:58.4358824Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:26:58.4359081Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:26:58.4359443Z #define __DECIMAL_DIG__ 21 2025-05-07T20:26:58.4359773Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:26:58.4360038Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:26:58.4360290Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:26:58.4360549Z #define SING 2 2025-05-07T20:26:58.4360759Z #define STA_FREQHOLD 0x0080 2025-05-07T20:26:58.4361014Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4361311Z #define cudaStreamDefault 0x00 2025-05-07T20:26:58.4361647Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:26:58.4362001Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:26:58.4362262Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:26:58.4362523Z #define __gnu_linux__ 1 2025-05-07T20:26:58.4362748Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:26:58.4362996Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:26:58.4363233Z #define MAX_INPUT 255 2025-05-07T20:26:58.4363465Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:26:58.4363777Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:26:58.4364146Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:26:58.4364509Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:26:58.4364764Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:26:58.4365148Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:26:58.4365557Z #define _IO_SHOWPOS 02000 2025-05-07T20:26:58.4365868Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:26:58.4366220Z #define _Mfloat_ float 2025-05-07T20:26:58.4366475Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:26:58.4366776Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.4367047Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:26:58.4367521Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:26:58.4367999Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4368257Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:26:58.4368577Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:26:58.4368930Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:26:58.4369212Z #define __USE_ISOC11 1 2025-05-07T20:26:58.4369435Z #define _BSD_SIZE_T_ 2025-05-07T20:26:58.4369658Z #define ADJ_MICRO 0x1000 2025-05-07T20:26:58.4379405Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:26:58.4379730Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:26:58.4380036Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:26:58.4380354Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:26:58.4380668Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:26:58.4381000Z #define __THROW throw () 2025-05-07T20:26:58.4381257Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:26:58.4381544Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4381901Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.4382258Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:26:58.4382530Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:26:58.4382806Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:26:58.4383083Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:26:58.4383339Z #define L_tmpnam 20 2025-05-07T20:26:58.4383573Z #define ___int_wchar_t_h 2025-05-07T20:26:58.4383912Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:26:58.4384285Z #define isascii(c) __isascii (c) 2025-05-07T20:26:58.4384548Z #define _T_PTRDIFF 2025-05-07T20:26:58.4384852Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:26:58.4385206Z #define toascii(c) __toascii (c) 2025-05-07T20:26:58.4385455Z #define __GNUC__ 11 2025-05-07T20:26:58.4385708Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:26:58.4386006Z #define __GXX_RTTI 1 2025-05-07T20:26:58.4386222Z #define __pie__ 2 2025-05-07T20:26:58.4386434Z #define __MMX__ 1 2025-05-07T20:26:58.4386656Z #define __cudaCDP2Malloc 2025-05-07T20:26:58.4386903Z #define __timespec_defined 1 2025-05-07T20:26:58.4387154Z #define L_ctermid 9 2025-05-07T20:26:58.4387591Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:58.4388023Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:26:58.4388411Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:26:58.4388782Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:26:58.4389040Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:26:58.4389331Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:26:58.4389640Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:26:58.4389946Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:26:58.4390211Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:26:58.4390647Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:26:58.4391384Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:58.4391974Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:26:58.4392279Z #define __USE_SVID 1 2025-05-07T20:26:58.4392539Z #define __constant__ __location__(constant) 2025-05-07T20:26:58.4392851Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:26:58.4393150Z #define __device__ __location__(device) 2025-05-07T20:26:58.4393476Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:26:58.4393803Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:26:58.4394065Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:26:58.4394348Z #define CUDART_DEVICE __device__ 2025-05-07T20:26:58.4394695Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:26:58.4395052Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:26:58.4395338Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:26:58.4395706Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:26:58.4396075Z #define __STDC_UTF_16__ 1 2025-05-07T20:26:58.4396324Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:26:58.4396683Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:26:58.4397116Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:26:58.4397434Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:26:58.4397695Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:26:58.4397956Z #define NGROUPS_MAX 65536 2025-05-07T20:26:58.4398666Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:26:58.4398983Z #define __USE_ISOC95 1 2025-05-07T20:26:58.4399212Z #define _TIME_H 1 2025-05-07T20:26:58.4399482Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:26:58.4399794Z #define __USE_ISOC99 1 2025-05-07T20:26:58.4400113Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:26:58.4400479Z #define HOST_NAME_MAX 64 2025-05-07T20:26:58.4400722Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:26:58.4400983Z #define _IOS_ATEND 4 2025-05-07T20:26:58.4401218Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:26:58.4401532Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:26:58.4401929Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:58.4402276Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:26:58.4402563Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:26:58.4402876Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:26:58.4403187Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:26:58.4403448Z #define _STDIO_H 1 2025-05-07T20:26:58.4403831Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:26:58.4404291Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:26:58.4404645Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:58.4405012Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:26:58.4405301Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:26:58.4405567Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:26:58.4405831Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:26:58.4406119Z #define __cpp_raw_strings 200710L 2025-05-07T20:26:58.4406419Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4406730Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:26:58.4407392Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:26:58.4407673Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:26:58.4407976Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:26:58.4408238Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:26:58.4408525Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:26:58.4408874Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:26:58.4409231Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:26:58.4409477Z #define __USE_XOPEN 1 2025-05-07T20:26:58.4409724Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:26:58.4410150Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:58.4410586Z #define __USE_XOPEN2K 1 2025-05-07T20:26:58.4410826Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:26:58.4411092Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:26:58.4411384Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:26:58.4411656Z #define __cpp_fold_expressions 201603L 2025-05-07T20:26:58.4412173Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:26:58.4412689Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:26:58.4412975Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:26:58.4413323Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:26:58.4413799Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:26:58.4414169Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:26:58.4414553Z #define __END_NAMESPACE_C99 2025-05-07T20:26:58.4414812Z #define __glibcxx_integral_traps true 2025-05-07T20:26:58.4415092Z #define _POSIX_PATH_MAX 256 2025-05-07T20:26:58.4415338Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:26:58.4415588Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:26:58.4415845Z #define _ISOC11_SOURCE 1 2025-05-07T20:26:58.4416089Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:26:58.4416374Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:26:58.4416668Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:26:58.4417029Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:26:58.4417401Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:26:58.4417668Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:26:58.4417924Z #define _IO_UNITBUF 020000 2025-05-07T20:26:58.4418167Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:26:58.4418415Z #define __FD_SETSIZE 1024 2025-05-07T20:26:58.4418660Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:26:58.4418925Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:26:58.4419250Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:26:58.4419604Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:26:58.4419861Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:26:58.4420165Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:26:58.4420475Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:26:58.4420742Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:26:58.4421034Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:26:58.4421366Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:26:58.4421646Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:26:58.4421965Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:26:58.4422237Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:26:58.4422503Z #define __USE_POSIX199506 1 2025-05-07T20:26:58.4422741Z #define _FEATURES_H 1 2025-05-07T20:26:58.4422965Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:26:58.4423344Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:26:58.4423745Z #define __stub_getmsg 2025-05-07T20:26:58.4423972Z #define _IO_FIXED 010000 2025-05-07T20:26:58.4424229Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:26:58.4424531Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:26:58.4424792Z #define __stub_setlogin 2025-05-07T20:26:58.4425014Z #define __stub_fattach 2025-05-07T20:26:58.4425250Z #define __cplusplus 201703L 2025-05-07T20:26:58.4425508Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:26:58.4425894Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:26:58.4426225Z #define INFINITY (__builtin_inff()) 2025-05-07T20:26:58.4426496Z #define _IO_UNBUFFERED 2 2025-05-07T20:26:58.4426962Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:26:58.4427478Z #define _IO_INTERNAL 010 2025-05-07T20:26:58.4427719Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:26:58.4428037Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:26:58.4428385Z #define __dev_t_defined 2025-05-07T20:26:58.4428617Z #define __DEPRECATED 1 2025-05-07T20:26:58.4428840Z #define __S32_TYPE int 2025-05-07T20:26:58.4429075Z #define __cpp_rvalue_references 200610L 2025-05-07T20:26:58.4429361Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:26:58.4429617Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:26:58.4429858Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:26:58.4430448Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:26:58.4431077Z #define _G_HAVE_MREMAP 1 2025-05-07T20:26:58.4431377Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:58.4431715Z #define OVERFLOW 3 2025-05-07T20:26:58.4431960Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:26:58.4432256Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:26:58.4432535Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.4432866Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:26:58.4433191Z #define __SSE2_MATH__ 1 2025-05-07T20:26:58.4433424Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:26:58.4433725Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.4434017Z #define _IO_STDIO_H 2025-05-07T20:26:58.4434246Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:26:58.4434532Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:26:58.4434839Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:26:58.4435120Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4435427Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:26:58.4435692Z #define __amd64 1 2025-05-07T20:26:58.4435903Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:26:58.4436161Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:26:58.4436431Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:26:58.4436707Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:26:58.4437004Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:26:58.4437264Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:26:58.4437556Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:26:58.4437805Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:26:58.4438049Z #define __bounded 2025-05-07T20:26:58.4438292Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:26:58.4438572Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:26:58.4438839Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:26:58.4439099Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:26:58.4439372Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4439678Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:26:58.4440092Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:58.4440488Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:26:58.4440776Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:26:58.4441124Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:26:58.4441464Z #define STA_PLL 0x0001 2025-05-07T20:26:58.4441703Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:26:58.4441955Z #define __GNUG__ 11 2025-05-07T20:26:58.4442181Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:26:58.4442439Z #define _T_WCHAR 2025-05-07T20:26:58.4442664Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:26:58.4442944Z #define __specialization_static 2025-05-07T20:26:58.4443237Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:26:58.4443530Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:26:58.4443783Z #define cudaArraySparse 0x40 2025-05-07T20:26:58.4444037Z #define STA_PPSFREQ 0x0002 2025-05-07T20:26:58.4444272Z #define __GLIBCXX__ 20230528 2025-05-07T20:26:58.4444754Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:26:58.4445155Z #define _WCHAR_T 2025-05-07T20:26:58.4445361Z #define __cudaCDP2Free 2025-05-07T20:26:58.4445982Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:26:58.4446647Z #define __cpp_nsdmi 200809L 2025-05-07T20:26:58.4447047Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:26:58.4447472Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:26:58.4447743Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:26:58.4447999Z #define cudaArrayCubemap 0x04 2025-05-07T20:26:58.4448316Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:58.4448658Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:26:58.4448891Z #define __NO_CTYPE 1 2025-05-07T20:26:58.4449110Z #define __stub_bdflush 2025-05-07T20:26:58.4449474Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:26:58.4449882Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:26:58.4450174Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:26:58.4450431Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:26:58.4450715Z #define __cpp_initializer_lists 200806L 2025-05-07T20:26:58.4451045Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:26:58.4451326Z #define __U16_TYPE unsigned short int 2025-05-07T20:26:58.4451654Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:26:58.4451993Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:26:58.4452262Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:26:58.4452535Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:26:58.4452864Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:26:58.4453198Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:26:58.4453463Z #define _IO_STDIO 040000 2025-05-07T20:26:58.4453920Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:26:58.4454364Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:26:58.4454709Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:26:58.4455030Z #define _PTRDIFF_T 2025-05-07T20:26:58.4455261Z #define _MOVE_H 1 2025-05-07T20:26:58.4455498Z #define __cpp_hex_float 201603L 2025-05-07T20:26:58.4455785Z #define ADJ_TAI 0x0080 2025-05-07T20:26:58.4456031Z #define __ptrvalue 2025-05-07T20:26:58.4456263Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:26:58.4456542Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:26:58.4456852Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:26:58.4457182Z #define MATH_ERREXCEPT 2 2025-05-07T20:26:58.4457452Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:26:58.4457767Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:26:58.4458213Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:26:58.4458640Z #define __USE_GNU 1 2025-05-07T20:26:58.4458888Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:26:58.4459192Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:26:58.4459478Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:26:58.4459910Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:26:58.4460343Z #define WEXITED 4 2025-05-07T20:26:58.4460566Z #define _IO_NO_READS 4 2025-05-07T20:26:58.4460899Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:26:58.4461290Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:26:58.4461590Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:26:58.4461916Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:26:58.4462264Z #define __uid_t_defined 2025-05-07T20:26:58.4462526Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:26:58.4462840Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:26:58.4463137Z #define WNOHANG 1 2025-05-07T20:26:58.4463400Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:26:58.4463734Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:26:58.4464137Z #define cudaEventDefault 0x00 2025-05-07T20:26:58.4464554Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:26:58.4464856Z #define NL_SETMAX INT_MAX 2025-05-07T20:26:58.4465087Z #define __x86_64 1 2025-05-07T20:26:58.4465311Z #define __cudaCDP2LaunchDevice 2025-05-07T20:26:58.4465685Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.4466146Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:26:58.4466629Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:58.4467047Z #define __PTRDIFF_T 2025-05-07T20:26:58.4467353Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:26:58.4467719Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:26:58.4467983Z #define _Mlong_double_ long double 2025-05-07T20:26:58.4468246Z #define __cpp_lambdas 200907L 2025-05-07T20:26:58.4468494Z #define _IO_DEC 020 2025-05-07T20:26:58.4468724Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:26:58.4468988Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:26:58.4469271Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:26:58.4469543Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:26:58.4469789Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:26:58.4470081Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:26:58.4470400Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:26:58.4470659Z #define _ANSI_STDDEF_H 2025-05-07T20:26:58.4470923Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:26:58.4471227Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:26:58.4471581Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:26:58.4471954Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:26:58.4472229Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:26:58.4472518Z #define __cpp_template_auto 201606L 2025-05-07T20:26:58.4472862Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:26:58.4473231Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:26:58.4473499Z #define __key_t_defined 2025-05-07T20:26:58.4473736Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:26:58.4474094Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:26:58.4474544Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:26:58.4474903Z #define __GNUC_VA_LIST 2025-05-07T20:26:58.4475219Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:58.4475590Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:26:58.4475851Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:26:58.4476114Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:26:58.4476399Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:26:58.4476640Z #define __WCOREFLAG 0x80 2025-05-07T20:26:58.4476880Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:26:58.4477175Z #define cudaEventDisableTiming 0x02 2025-05-07T20:26:58.4477445Z #define __LP64__ 1 2025-05-07T20:26:58.4477685Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:26:58.4477998Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:26:58.4478277Z #define _IO_off64_t __off64_t 2025-05-07T20:26:58.4478523Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4478777Z #define __time_t_defined 1 2025-05-07T20:26:58.4479021Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:26:58.4479356Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:26:58.4479706Z #define __USE_UNIX98 1 2025-05-07T20:26:58.4479942Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:26:58.4480213Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:26:58.4480470Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:26:58.4480761Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:26:58.4481064Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:26:58.4481311Z #define SEEK_CUR 1 2025-05-07T20:26:58.4481536Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.4481796Z #define _ASSERT_H 1 2025-05-07T20:26:58.4482441Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:26:58.4483127Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:26:58.4483401Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:26:58.4483655Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:26:58.4483910Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:26:58.4484178Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:26:58.4484547Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:26:58.4484945Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:26:58.4485581Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:26:58.4486214Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:26:58.4486502Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:26:58.4486852Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:26:58.4487226Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:26:58.4487489Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.4487761Z #define cudaArrayDefault 0x00 2025-05-07T20:26:58.4488033Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:26:58.4488317Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:26:58.4488582Z #define TLOSS 5 2025-05-07T20:26:58.4488794Z #define __ssize_t_defined 2025-05-07T20:26:58.4489043Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:26:58.4489297Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:26:58.4489582Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:26:58.4489869Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:26:58.4490213Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:26:58.4490590Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:26:58.4490868Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:26:58.4491150Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:26:58.4491452Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:26:58.4491743Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:26:58.4492020Z #define __REGISTER_PREFIX__ 2025-05-07T20:26:58.4492265Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:26:58.4492587Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:26:58.4492939Z #define _IOS_NOREPLACE 64 2025-05-07T20:26:58.4493163Z #define __cdecl 2025-05-07T20:26:58.4493394Z #define cudaEventInterprocess 0x04 2025-05-07T20:26:58.4493822Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:26:58.4494136Z #define LOGIN_NAME_MAX 256 2025-05-07T20:26:58.4494381Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:26:58.4494647Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:26:58.4494922Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:26:58.4495182Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:26:58.4495481Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:26:58.4495799Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:26:58.4496191Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:58.4496614Z #define ADJ_NANO 0x2000 2025-05-07T20:26:58.4496910Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:26:58.4497253Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:26:58.4497534Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:26:58.4497901Z #define __FLT_DIG__ 6 2025-05-07T20:26:58.4498609Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:26:58.4505164Z #define __NO_INLINE__ 1 2025-05-07T20:26:58.4505518Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:26:58.4505878Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:26:58.4506139Z #define ADJ_STATUS 0x0010 2025-05-07T20:26:58.4506413Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:26:58.4506712Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:26:58.4506987Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:58.4507632Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:26:58.4508081Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:26:58.4508464Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:26:58.4508878Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:26:58.4509236Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:26:58.4509591Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:26:58.4509835Z #define MAX_CANON 255 2025-05-07T20:26:58.4510077Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:26:58.4510340Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:26:58.4510614Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:26:58.4510912Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:26:58.4511230Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:26:58.4511540Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:26:58.4511818Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:26:58.4512145Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:26:58.4512474Z #define __VERSION__ "11.4.0" 2025-05-07T20:26:58.4512740Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:26:58.4513039Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:26:58.4513332Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:26:58.4513608Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:26:58.4513930Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:26:58.4514230Z #define __UINT64_C(c) c ## UL 2025-05-07T20:26:58.4514491Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:26:58.4514751Z #define _SYS_TYPES_H 1 2025-05-07T20:26:58.4514997Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:26:58.4515257Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:26:58.4515514Z #define _SYS_CDEFS_H 1 2025-05-07T20:26:58.4515754Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:26:58.4516034Z #define __cpp_unicode_characters 201411L 2025-05-07T20:26:58.4516323Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:26:58.4516582Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:26:58.4516878Z #define __cudaCDP2StreamDestroy 2025-05-07T20:26:58.4517149Z #define FP_SUBNORMAL 3 2025-05-07T20:26:58.4517410Z #define cudaOccupancyDefault 0x00 2025-05-07T20:26:58.4517693Z #define _INITIALIZER_LIST 2025-05-07T20:26:58.4517940Z #define _STDC_PREDEF_H 1 2025-05-07T20:26:58.4518196Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:26:58.4518475Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:26:58.4518760Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:26:58.4519023Z #define _IO_file_flags _flags 2025-05-07T20:26:58.4519284Z #define __USE_XOPEN2K8 1 2025-05-07T20:26:58.4519534Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:26:58.4519817Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:26:58.4520100Z #define HUGE 3.40282347e+38F 2025-05-07T20:26:58.4520367Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:26:58.4520746Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:26:58.4521142Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:26:58.4521455Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:26:58.4521727Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:26:58.4521986Z #define _BSD_SOURCE 1 2025-05-07T20:26:58.4522223Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:26:58.4523059Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:26:58.4523891Z #define __catch(X) catch(X) 2025-05-07T20:26:58.4524152Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:26:58.4524448Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:26:58.4524715Z #define __TIMER_T_TYPE void * 2025-05-07T20:26:58.4524971Z #define __STRING(x) #x 2025-05-07T20:26:58.4525216Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:26:58.4525482Z #define _T_PTRDIFF_ 2025-05-07T20:26:58.4525729Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:26:58.4526033Z #define cudaEventWaitExternal 0x01 2025-05-07T20:26:58.4526302Z #define __unbounded 2025-05-07T20:26:58.4526678Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.4527055Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:26:58.4527327Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.4527625Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:26:58.4527908Z #define __cpp_lib_is_final 201402L 2025-05-07T20:26:58.4528196Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:26:58.4528523Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:26:58.4528825Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:26:58.4529103Z #define __managed__ __location__(managed) 2025-05-07T20:26:58.4529392Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:26:58.4529782Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:26:58.4530194Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:26:58.4530443Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:26:58.4530806Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:26:58.4531204Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:26:58.4531466Z #define _SYS_SIZE_T_H 2025-05-07T20:26:58.4531756Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:26:58.4532090Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:26:58.4532363Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:26:58.4532652Z #define _CRTIMP 2025-05-07T20:26:58.4532877Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:26:58.4533181Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:58.4533497Z #define STA_PPSJITTER 0x0200 2025-05-07T20:26:58.4533998Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:26:58.4534398Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4534701Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:26:58.4534974Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:26:58.4535251Z #define __SIZE_T__ 2025-05-07T20:26:58.4535454Z #define __stub_gtty 2025-05-07T20:26:58.4535682Z #define __pid_t_defined 2025-05-07T20:26:58.4535938Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:26:58.4536234Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.4536543Z #define __glibcxx_function_requires(...) 2025-05-07T20:26:58.4536826Z #define __SM_80_RT_HPP__ 2025-05-07T20:26:58.4537066Z #define __need_clockid_t 2025-05-07T20:26:58.4537297Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:26:58.4537550Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:26:58.4537865Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:26:58.4538165Z #define _IO_HEX 0100 2025-05-07T20:26:58.4538418Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:26:58.4538750Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:26:58.4539048Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:26:58.4539319Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:26:58.4539717Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.4540139Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:26:58.4540452Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:26:58.4540747Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:26:58.4540851Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:26:58.4540961Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:26:58.4541042Z #define __stub_sstk 2025-05-07T20:26:58.4541134Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:26:58.4541287Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:26:58.4541368Z #define __wur 2025-05-07T20:26:58.4541482Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:26:58.4541570Z #define _G_HAVE_MMAP 1 2025-05-07T20:26:58.4541650Z #define _IO_OCT 040 2025-05-07T20:26:58.4541746Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:26:58.4541841Z #define NL_MSGMAX INT_MAX 2025-05-07T20:26:58.4541927Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:26:58.4542051Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:26:58.4542148Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:26:58.4542247Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:26:58.4542550Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:26:58.4542721Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:26:58.4542809Z #define _STL_ALGOBASE_H 1 2025-05-07T20:26:58.4542919Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:26:58.4543009Z #define __off64_t_defined 2025-05-07T20:26:58.4543105Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:26:58.4543194Z #define __FLT128_DIG__ 33 2025-05-07T20:26:58.4543295Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:26:58.4543388Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:26:58.4543476Z #define __INT32_C(c) c 2025-05-07T20:26:58.4543571Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:26:58.4543662Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:26:58.4543759Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:26:58.4543847Z #define __PDP_ENDIAN 3412 2025-05-07T20:26:58.4543939Z #define _ISOC95_SOURCE 1 2025-05-07T20:26:58.4544032Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:26:58.4544162Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:26:58.4544267Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:26:58.4544353Z #define __SM_90_RT_HPP__ 2025-05-07T20:26:58.4544447Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:26:58.4544545Z #define __have_pthread_attr_t 1 2025-05-07T20:26:58.4544643Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:26:58.4544856Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:26:58.4544967Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:26:58.4545065Z #define __cudaCDP2EventRecord 2025-05-07T20:26:58.4545156Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:26:58.4545244Z #define htole32(x) (x) 2025-05-07T20:26:58.4545488Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:26:58.4545610Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:26:58.4545705Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:26:58.4545857Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:26:58.4546003Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:26:58.4546130Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:26:58.4546263Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:26:58.4546362Z #define ADJ_OFFSET 0x0001 2025-05-07T20:26:58.4546459Z #define cudaArrayLayered 0x01 2025-05-07T20:26:58.4546629Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:26:58.4546735Z #define cudaEventRecordDefault 0x00 2025-05-07T20:26:58.4546827Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:26:58.4546928Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:26:58.4547006Z #define unix 1 2025-05-07T20:26:58.4547094Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:26:58.4547189Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:26:58.4547280Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:26:58.4547395Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:26:58.4547486Z #define __USE_POSIX 1 2025-05-07T20:26:58.4547578Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:26:58.4547710Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:26:58.4547811Z #define __THROWNL throw () 2025-05-07T20:26:58.4547898Z #define __cpp_rtti 199711L 2025-05-07T20:26:58.4548003Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:26:58.4548088Z #define __PMT(args) args 2025-05-07T20:26:58.4548197Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4548343Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:26:58.4548453Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:26:58.4548539Z #define _SIZE_T_DECLARED 2025-05-07T20:26:58.4548637Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:26:58.4548726Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:26:58.4549106Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:26:58.4549210Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:26:58.4549298Z #define XATTR_LIST_MAX 65536 2025-05-07T20:26:58.4549496Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:26:58.4549749Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:26:58.4549829Z #define _WCHAR_T_H 2025-05-07T20:26:58.4549919Z #define __FLT64X_DIG__ 18 2025-05-07T20:26:58.4550005Z #define _IO_SHOWBASE 0200 2025-05-07T20:26:58.4550089Z #define _POSIX_QLIMIT 1 2025-05-07T20:26:58.4550191Z #define __INT8_TYPE__ signed char 2025-05-07T20:26:58.4550282Z #define __SURFACE_TYPES_H__ 2025-05-07T20:26:58.4550366Z #define __CUDA_ARCH__ 520 2025-05-07T20:26:58.4550477Z #define __cpp_digit_separators 201309L 2025-05-07T20:26:58.4550556Z #define __ELF__ 1 2025-05-07T20:26:58.4550652Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:26:58.4550758Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:26:58.4550842Z #define STA_INS 0x0010 2025-05-07T20:26:58.4550936Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:26:58.4551113Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:26:58.4551204Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:26:58.4551315Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:26:58.4551422Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4551527Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4551629Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:26:58.4551732Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:26:58.4551825Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:26:58.4551983Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:26:58.4552133Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:26:58.4552226Z #define _IO_funlockfile(_fp) 2025-05-07T20:26:58.4552551Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:58.4552675Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:26:58.4552773Z #define __DRIVER_TYPES_H__ 2025-05-07T20:26:58.4552857Z #define __FLT_RADIX__ 2 2025-05-07T20:26:58.4552956Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:26:58.4553127Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:26:58.4553224Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:26:58.4553314Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:26:58.4553420Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:26:58.4553514Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:26:58.4553606Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:26:58.4553712Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:26:58.4553792Z #define WORD_BIT 32 2025-05-07T20:26:58.4553879Z #define _IO_USER_BUF 1 2025-05-07T20:26:58.4553967Z #define __VECTOR_TYPES_H__ 2025-05-07T20:26:58.4554066Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4554177Z #define cudaHostAllocPortable 0x01 2025-05-07T20:26:58.4554271Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:26:58.4554367Z #define __long_double_t long double 2025-05-07T20:26:58.4554464Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:26:58.4554551Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:26:58.4554945Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:26:58.4555032Z #define __k8 1 2025-05-07T20:26:58.4555220Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:26:58.4555389Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:26:58.4555499Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:26:58.4555597Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:26:58.4555697Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:26:58.4555792Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:26:58.4555884Z #define __blksize_t_defined 2025-05-07T20:26:58.4555978Z #define _IO_SHOWPOINT 0400 2025-05-07T20:26:58.4556071Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:26:58.4556180Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:26:58.4556279Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:26:58.4556381Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:26:58.4556565Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:26:58.4556744Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:26:58.4556991Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:26:58.4557338Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:26:58.4557435Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:26:58.4557537Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:26:58.4557614Z #define SEEK_SET 0 2025-05-07T20:26:58.4557706Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:26:58.4557805Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:26:58.4557990Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:26:58.4558088Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:26:58.4558195Z #define __cudaCDP2GetLastError 2025-05-07T20:26:58.4558287Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:26:58.4558380Z #define _MATH_H_MATHDEF 1 2025-05-07T20:26:58.4558699Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:26:58.4558802Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:26:58.4558901Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:26:58.4558987Z #define __stub_sigreturn 2025-05-07T20:26:58.4559216Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:26:58.4559313Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:26:58.4559399Z #define __HOST_CONFIG_H__ 2025-05-07T20:26:58.4559494Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:26:58.4559580Z #define CLOCK_TAI 11 2025-05-07T20:26:58.4559682Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:26:58.4559774Z #define __restrict_arr 2025-05-07T20:26:58.4559880Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:26:58.4560016Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:26:58.4560535Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:58.4560718Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:26:58.4560800Z #define __USE_MISC 1 2025-05-07T20:26:58.4560909Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:26:58.4561004Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:26:58.4561099Z #define _GCC_LIMITS_H_ 2025-05-07T20:26:58.4561182Z #define __LDBL_DIG__ 18 2025-05-07T20:26:58.4561278Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:26:58.4561383Z #define __malloc_and_calloc_defined 2025-05-07T20:26:58.4561473Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:26:58.4561574Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:26:58.4561660Z #define __x86_64__ 1 2025-05-07T20:26:58.4561736Z #define _SIZE_T_ 2025-05-07T20:26:58.4562595Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:26:58.4562708Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:26:58.4562801Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:26:58.4562919Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:26:58.4563031Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:26:58.4563122Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:26:58.4563231Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:26:58.4563348Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:26:58.4563482Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:26:58.4563582Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:26:58.4564113Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:26:58.4564316Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:26:58.4564454Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:26:58.4564553Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:26:58.4564650Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:26:58.4564734Z #define STA_FLL 0x0008 2025-05-07T20:26:58.4564871Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:26:58.4564973Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:26:58.4565090Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4565195Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:26:58.4565283Z #define __stub_revoke 2025-05-07T20:26:58.4565370Z #define __timer_t_defined 1 2025-05-07T20:26:58.4565503Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:26:58.4565591Z #define INT_MAX __INT_MAX__ 2025-05-07T20:26:58.4565693Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:26:58.4565805Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:26:58.4565905Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:26:58.4566002Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:26:58.4566113Z #define cudaArrayTextureGather 0x08 2025-05-07T20:26:58.4566208Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:26:58.4566346Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:26:58.4566444Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:26:58.4566532Z #define _IO_off_t __off_t 2025-05-07T20:26:58.4566615Z #define __FLT64_DIG__ 15 2025-05-07T20:26:58.4566833Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:26:58.4566925Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:26:58.4567055Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4567171Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:26:58.4567264Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:26:58.4567370Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:26:58.4567453Z #define NULL __null 2025-05-07T20:26:58.4567584Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:26:58.4567691Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:26:58.4567784Z #define __U64_TYPE unsigned long int 2025-05-07T20:26:58.4567877Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4567971Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:26:58.4568049Z #define FP_ZERO 2 2025-05-07T20:26:58.4568150Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:26:58.4568294Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:26:58.4568398Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4568486Z #define __WCHAR_T__ 2025-05-07T20:26:58.4568576Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:26:58.4568766Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:26:58.4568914Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:26:58.4569007Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:26:58.4569130Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:26:58.4569250Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:26:58.4569372Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:26:58.4569499Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:26:58.4569587Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:26:58.4569675Z #define _SIGSET_H_types 1 2025-05-07T20:26:58.4569791Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:26:58.4569893Z #define __cpp_unicode_literals 200710L 2025-05-07T20:26:58.4570034Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:26:58.4570139Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:26:58.4570254Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:26:58.4570380Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:26:58.4570496Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:26:58.4570619Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:26:58.4570919Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:26:58.4571094Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:26:58.4571194Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:26:58.4571297Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:26:58.4571381Z #define STA_MODE 0x4000 2025-05-07T20:26:58.4571490Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:26:58.4571592Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:26:58.4571702Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:26:58.4571800Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:26:58.4571897Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:26:58.4571999Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:26:58.4572092Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:26:58.4572206Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:26:58.4572291Z #define __SIZE_WIDTH__ 64 2025-05-07T20:26:58.4572412Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4572489Z #define __SEG_FS 1 2025-05-07T20:26:58.4572580Z #define _IO_size_t size_t 2025-05-07T20:26:58.4572689Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:26:58.4572785Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:26:58.4572867Z #define __stub_lchmod 2025-05-07T20:26:58.4572962Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:26:58.4573065Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4573158Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:26:58.4573244Z #define __SEG_GS 1 2025-05-07T20:26:58.4573418Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:26:58.4573502Z #define _IOS_APPEND 8 2025-05-07T20:26:58.4573600Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:26:58.4573811Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:26:58.4573911Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:26:58.4574005Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:26:58.4574102Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:26:58.4574188Z #define htole16(x) (x) 2025-05-07T20:26:58.4574293Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:58.4574389Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:26:58.4574491Z #define __INT16_TYPE__ short int 2025-05-07T20:26:58.4574590Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:26:58.4574696Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:26:58.4574807Z #define __cpp_structured_bindings 201606L 2025-05-07T20:26:58.4574924Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:26:58.4575015Z #define __SIZEOF_INT__ 4 2025-05-07T20:26:58.4575102Z #define __WCLONE 0x80000000 2025-05-07T20:26:58.4575193Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:26:58.4575283Z #define SEEK_HOLE 4 2025-05-07T20:26:58.4575368Z #define TIMER_ABSTIME 1 2025-05-07T20:26:58.4575458Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:26:58.4575553Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:26:58.4575722Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:58.4575829Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4575925Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:26:58.4576037Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:26:58.4576130Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4576257Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:26:58.4576342Z #define _LINUX_LIMITS_H 2025-05-07T20:26:58.4576424Z #define linux 1 2025-05-07T20:26:58.4576513Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:26:58.4576621Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:26:58.4576723Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:26:58.4576814Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:26:58.4576915Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:26:58.4577057Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:26:58.4577150Z #define __cpp_lib_hypot 201603 2025-05-07T20:26:58.4577242Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4577343Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:26:58.4577431Z #define MOD_NANO ADJ_NANO 2025-05-07T20:26:58.4577513Z #define htole64(x) (x) 2025-05-07T20:26:58.4577705Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:26:58.4577933Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:26:58.4578029Z #define _IO_UPPERCASE 01000 2025-05-07T20:26:58.4578508Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:26:58.4578593Z #define __USE_POSIX2 1 2025-05-07T20:26:58.4578693Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:26:58.4578777Z #define __WALL 0x40000000 2025-05-07T20:26:58.4578870Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:26:58.4578959Z #define _XLOCALE_H 1 2025-05-07T20:26:58.4579050Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:26:58.4579146Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:26:58.4579244Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:26:58.4579345Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:26:58.4579435Z #define __EXCEPTIONS 1 2025-05-07T20:26:58.4579532Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:26:58.4579721Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:26:58.4579815Z #define __WORDSIZE 64 2025-05-07T20:26:58.4579903Z #define CLOCK_MONOTONIC 1 2025-05-07T20:26:58.4579990Z #define _STL_RELOPS_H 1 2025-05-07T20:26:58.4580090Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:26:58.4580184Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:26:58.4580278Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:26:58.4580375Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:26:58.4580469Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:26:58.4580783Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:26:58.4581037Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:26:58.4581161Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:26:58.4581262Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:26:58.4581362Z #define __cpp_range_based_for 201603L 2025-05-07T20:26:58.4581473Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:26:58.4581586Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:26:58.4581690Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:26:58.4581867Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:26:58.4581966Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:26:58.4582055Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:26:58.4582153Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:26:58.4582329Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:26:58.4582442Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:26:58.4582529Z #define _STRING_H 1 2025-05-07T20:26:58.4582631Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:26:58.4582718Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:26:58.4582817Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:26:58.4582949Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:26:58.4583040Z #define __code_model_small__ 1 2025-05-07T20:26:58.4583138Z #define _PSTL_CONFIG_H 2025-05-07T20:26:58.4583242Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:26:58.4583355Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:26:58.4583452Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:26:58.4583550Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:26:58.4583880Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:26:58.4583971Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:26:58.4584052Z #define le64toh(x) (x) 2025-05-07T20:26:58.4584146Z #define FILENAME_MAX 4096 2025-05-07T20:26:58.4584289Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:26:58.4584400Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:26:58.4584488Z #define L_cuserid 9 2025-05-07T20:26:58.4584572Z #define __ino_t_defined 2025-05-07T20:26:58.4584649Z #define __k8__ 1 2025-05-07T20:26:58.4584748Z #define __INTPTR_TYPE__ long int 2025-05-07T20:26:58.4584850Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:26:58.4585026Z #define __int8_t_defined 2025-05-07T20:26:58.4585197Z #define __WCHAR_TYPE__ int 2025-05-07T20:26:58.4585295Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:26:58.4585411Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:26:58.4585506Z #define __SLONGWORD_TYPE long int 2025-05-07T20:26:58.4585588Z #define _IOS_TRUNC 16 2025-05-07T20:26:58.4585707Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:26:58.4585851Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:26:58.4585933Z #define __HAVE_COLUMN 2025-05-07T20:26:58.4586021Z #define __stub_fdetach 2025-05-07T20:26:58.4586412Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:26:58.4586493Z #define __pic__ 2 2025-05-07T20:26:58.4586614Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4586708Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:26:58.4586804Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:26:58.4586911Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:26:58.4586994Z #define __stub_chflags 2025-05-07T20:26:58.4587086Z #define CLOCK_BOOTTIME 7 2025-05-07T20:26:58.4587168Z #define __need_IOV_MAX 2025-05-07T20:26:58.4587273Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:26:58.4587379Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:26:58.4587472Z #define __cpp_decltype 200707L 2025-05-07T20:26:58.4587569Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:26:58.4587663Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:26:58.4587765Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:26:58.4587851Z #define TTY_NAME_MAX 32 2025-05-07T20:26:58.4588017Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:26:58.4588134Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4588303Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:26:58.4588411Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:26:58.4588505Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:26:58.4588606Z #define STA_PPSTIME 0x0004 2025-05-07T20:26:58.4588686Z #define __import__ 2025-05-07T20:26:58.4588772Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:26:58.4588910Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:26:58.4588990Z #define __export__ 2025-05-07T20:26:58.4589104Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:26:58.4589207Z #define cudaMemAttachHost 0x02 2025-05-07T20:26:58.4589362Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:58.4589462Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:26:58.4589549Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:26:58.4589641Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:26:58.4589733Z #define _WCHAR_T_DECLARED 2025-05-07T20:26:58.4589847Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:26:58.4589961Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:26:58.4590068Z #define __cpp_inline_variables 201606L 2025-05-07T20:26:58.4590162Z #define WNOWAIT 0x01000000 2025-05-07T20:26:58.4590249Z #define PLOSS 6 2025-05-07T20:26:58.4590345Z #define M_LN10 2.30258509299404568402 2025-05-07T20:26:58.4590601Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:26:58.4590695Z #define EXIT_SUCCESS 0 2025-05-07T20:26:58.4590790Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:26:58.4590881Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:26:58.4590986Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:26:58.4591075Z #define __thread__ __thread 2025-05-07T20:26:58.4591170Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:26:58.4591267Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:26:58.4591367Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:26:58.4591586Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:58.4591702Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:26:58.4591792Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:26:58.4591967Z #define __linux__ 1 2025-05-07T20:26:58.4592143Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:26:58.4592270Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:26:58.4592364Z #define __S16_TYPE short int 2025-05-07T20:26:58.4592698Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:26:58.4592802Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:26:58.4592993Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:26:58.4593090Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:26:58.4593186Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:26:58.4593275Z #define _T_SIZE_ 2025-05-07T20:26:58.4593371Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:26:58.4593491Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:26:58.4593589Z #define _PSTL_VERSION 12000 2025-05-07T20:26:58.4593706Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:26:58.4593810Z #define __WNOTHREAD 0x20000000 2025-05-07T20:26:58.4593910Z #define _G_va_list __gnuc_va_list 2025-05-07T20:26:58.4594037Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:26:58.4594129Z #define _IOS_INPUT 1 2025-05-07T20:26:58.4594218Z #define __USE_LARGEFILE64 1 2025-05-07T20:26:58.4594319Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:26:58.4594415Z #define __INT64_TYPE__ long int 2025-05-07T20:26:58.4594510Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:26:58.4594605Z #define __shared__ __location__(shared) 2025-05-07T20:26:58.4594697Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:26:58.4594847Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:26:58.4594932Z #define __gid_t_defined 2025-05-07T20:26:58.4595048Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:26:58.4595144Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:26:58.4595342Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:26:58.4595437Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:26:58.4595534Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:26:58.4595624Z #define ___int_size_t_h 2025-05-07T20:26:58.4595727Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.4595847Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:26:58.4596002Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:26:58.4596100Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:26:58.4596193Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:26:58.4596291Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:26:58.4596383Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:26:58.4596508Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4596615Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:26:58.4596732Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:26:58.4596826Z #define __clock_t_defined 1 2025-05-07T20:26:58.4596921Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:26:58.4597028Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:26:58.4597124Z #define __GLIBC_MINOR__ 17 2025-05-07T20:26:58.4597217Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:26:58.4597312Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:26:58.4597425Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:26:58.4597513Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:26:58.4597679Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:26:58.4597762Z #define __SSE__ 1 2025-05-07T20:26:58.4597856Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:26:58.4597957Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:26:58.4598038Z #define _CTYPE_H 1 2025-05-07T20:26:58.4598127Z #define __sigset_t_defined 2025-05-07T20:26:58.4598564Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:26:58.4598707Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:26:58.4598824Z #define MOD_TAI ADJ_TAI 2025-05-07T20:26:58.4598940Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:26:58.4599033Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:26:58.4599113Z #define __SM_70_RT_H__ 2025-05-07T20:26:58.4599573Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:26:58.4599676Z #define cudaEventWaitDefault 0x00 2025-05-07T20:26:58.4599772Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:26:58.4599943Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:58.4600034Z #define _POSIX_MAX_CANON 255 2025-05-07T20:26:58.4600144Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:26:58.4600236Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:26:58.4600322Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:26:58.4600415Z #define __WINT_WIDTH__ 32 2025-05-07T20:26:58.4600517Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:26:58.4600775Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.4600875Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:26:58.4600957Z #define EOF (-1) 2025-05-07T20:26:58.4601049Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:26:58.4601147Z #define __USE_POSIX199309 1 2025-05-07T20:26:58.4601244Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:26:58.4601347Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:26:58.4601439Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:26:58.4601534Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:26:58.4601648Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:26:58.4601739Z #define ____mbstate_t_defined 1 2025-05-07T20:26:58.4601841Z #define STA_NANO 0x2000 2025-05-07T20:26:58.4601940Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:26:58.4602031Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:26:58.4602114Z #define _IO_LINKED 0x80 2025-05-07T20:26:58.4602378Z #define __cpp_lib_launder 201606 2025-05-07T20:26:58.4602502Z #define __SIZEOF_INT128__ 16 2025-05-07T20:26:58.4608372Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:26:58.4608493Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:26:58.4608604Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:26:58.4608750Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:26:58.4608872Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4608992Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:26:58.4609090Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:26:58.4609192Z #define __W_CONTINUED 0xffff 2025-05-07T20:26:58.4609284Z #define __ATOMIC_RELAXED 0 2025-05-07T20:26:58.4609415Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:26:58.4609547Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:26:58.4609749Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:26:58.4609933Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:26:58.4610028Z #define __stub_stty 2025-05-07T20:26:58.4610194Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:26:58.4610289Z #define le16toh(x) (x) 2025-05-07T20:26:58.4610398Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:26:58.4610570Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:26:58.4610663Z #define _SIZET_ 2025-05-07T20:26:58.4610761Z #define XATTR_NAME_MAX 255 2025-05-07T20:26:58.4610854Z #define _SVID_SOURCE 1 2025-05-07T20:26:58.4610944Z #define _LP64 1 2025-05-07T20:26:58.4611034Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:26:58.4611265Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:26:58.4611382Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:26:58.4611469Z #define __UINT8_C(c) c 2025-05-07T20:26:58.4611565Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:26:58.4611667Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:26:58.4611779Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:26:58.4611881Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:26:58.4611975Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:26:58.4612075Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:26:58.4612173Z #define CUDARTAPI 2025-05-07T20:26:58.4612257Z #define IOV_MAX 1024 2025-05-07T20:26:58.4612403Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:26:58.4612637Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:26:58.4612828Z #define cudaMemAttachSingle 0x04 2025-05-07T20:26:58.4612912Z #define __wchar_t__ 2025-05-07T20:26:58.4613022Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:26:58.4613106Z #define SEEK_END 2 2025-05-07T20:26:58.4613199Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:26:58.4613376Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:26:58.4613474Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:26:58.4613750Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:26:58.4613842Z #define ____FILE_defined 1 2025-05-07T20:26:58.4613960Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:26:58.4614062Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:26:58.4614150Z #define _ISOC99_SOURCE 1 2025-05-07T20:26:58.4614245Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:26:58.4614497Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:26:58.4614631Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:26:58.4614720Z #define _IO_RIGHT 04 2025-05-07T20:26:58.4614821Z #define __END_NAMESPACE_STD 2025-05-07T20:26:58.4615005Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:26:58.4615104Z #define _GLIBCXX_STD_C std 2025-05-07T20:26:58.4615224Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:26:58.4615321Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:26:58.4615428Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:26:58.4615513Z #define _STDDEF_H_ 2025-05-07T20:26:58.4615596Z #define __amd64__ 1 2025-05-07T20:26:58.4615773Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:26:58.4615872Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4615990Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:26:58.4616193Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:26:58.4616305Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:26:58.4616458Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:26:58.4616589Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:26:58.4616695Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:26:58.4616813Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:26:58.4616914Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:26:58.4617027Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:26:58.4617131Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:26:58.4617228Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:26:58.4617324Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:26:58.4617502Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:26:58.4617596Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:26:58.4617778Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:26:58.4617877Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:26:58.4617974Z #define __STDCPP_THREADS__ 1 2025-05-07T20:26:58.4618130Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:26:58.4618235Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:26:58.4618329Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:26:58.4618438Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:26:58.4618535Z #define P_tmpdir "/tmp" 2025-05-07T20:26:58.4618654Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:26:58.4618755Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:26:58.4618857Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:26:58.4619021Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:26:58.4619197Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:26:58.4619298Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:26:58.4619426Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:26:58.4619539Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:26:58.4619641Z #define __location__(a) __annotate__(a) 2025-05-07T20:26:58.4619872Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:26:58.4620138Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:26:58.4620254Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:26:58.4620357Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:26:58.4620448Z #define __STDC_UTF_32__ 1 2025-05-07T20:26:58.4620543Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:26:58.4620647Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:26:58.4620744Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:26:58.4620837Z #define __FXSR__ 1 2025-05-07T20:26:58.4620919Z #define _SIZE_T 2025-05-07T20:26:58.4621022Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:26:58.4621134Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:26:58.4621307Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:58.4621454Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:26:58.4621551Z #define _IO_ssize_t __ssize_t 2025-05-07T20:26:58.4621650Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:26:58.4621839Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:26:58.4622045Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:26:58.4622133Z #define _GXX_NULLPTR_T 2025-05-07T20:26:58.4622255Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:26:58.4622346Z #define FOPEN_MAX 16 2025-05-07T20:26:58.4622434Z #define __BIG_ENDIAN 4321 2025-05-07T20:26:58.4622551Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:26:58.4622651Z #define __suseconds_t_defined 2025-05-07T20:26:58.4622738Z #define __off_t_defined 2025-05-07T20:26:58.4622831Z #define stderr stderr 2025-05-07T20:26:58.4622926Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:26:58.4623039Z #define __glibcxx_requires_string(_String) 2025-05-07T20:26:58.4623142Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:26:58.4623234Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:26:58.4623640Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:26:58.4623741Z #define __mode_t_defined 2025-05-07T20:26:58.4623825Z #define _GCC_SIZE_T 2025-05-07T20:26:58.4623923Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.4624035Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:26:58.4624140Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:26:58.4624235Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:26:58.4624335Z #define __UINT32_C(c) c ## U 2025-05-07T20:26:58.4624443Z #define __cpp_alias_templates 200704L 2025-05-07T20:26:58.4624555Z #define cudaHostAllocMapped 0x02 2025-05-07T20:26:58.4624662Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:26:58.4624753Z #define _STL_ITERATOR_H 1 2025-05-07T20:26:58.4624841Z #define __size_t__ 2025-05-07T20:26:58.4624970Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:26:58.4625065Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:26:58.4625181Z #define cudaEventRecordExternal 0x01 2025-05-07T20:26:58.4625334Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:26:58.4625433Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:26:58.4625605Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:26:58.4625691Z #define _ENDIAN_H 1 2025-05-07T20:26:58.4625802Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:26:58.4625899Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:26:58.4626000Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:26:58.4626087Z #define __try try 2025-05-07T20:26:58.4626186Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:26:58.4626281Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:26:58.4626379Z #define __INT8_MAX__ 0x7f 2025-05-07T20:26:58.4626631Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:26:58.4626720Z #define __LONG_WIDTH__ 64 2025-05-07T20:26:58.4626807Z #define __PIC__ 2 2025-05-07T20:26:58.4626919Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:26:58.4627038Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:26:58.4627284Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:26:58.4627486Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:26:58.4627584Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:26:58.4627764Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:26:58.4627863Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:26:58.4627968Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:26:58.4628056Z #define _IO_uid_t __uid_t 2025-05-07T20:26:58.4628154Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:26:58.4628288Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:26:58.4628379Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:26:58.4628523Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:26:58.4628630Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:26:58.4628749Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:26:58.4628837Z #define LONG_BIT 64 2025-05-07T20:26:58.4628943Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:26:58.4629053Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:26:58.4629191Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:26:58.4629286Z #define __fsfilcnt_t_defined 2025-05-07T20:26:58.4629376Z #define __blkcnt_t_defined 2025-05-07T20:26:58.4629648Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:26:58.4629740Z #define __USE_LARGEFILE 1 2025-05-07T20:26:58.4629839Z #define __cpp_constexpr 201603L 2025-05-07T20:26:58.4629939Z #define CUDART_VERSION 12060 2025-05-07T20:26:58.4630028Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:26:58.4630129Z #define cudaDeviceMapHost 0x08 2025-05-07T20:26:58.4630227Z #define _GLIBCXX_CMATH 1 2025-05-07T20:26:58.4630419Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:26:58.4630516Z #define __lldiv_t_defined 1 2025-05-07T20:26:58.4630598Z #define __SSE2__ 1 2025-05-07T20:26:58.4630680Z #define _IOLBF 1 2025-05-07T20:26:58.4630787Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:26:58.4630894Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:26:58.4631002Z #define __cpp_deduction_guides 201703L 2025-05-07T20:26:58.4631104Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:26:58.4631212Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:26:58.4631300Z #define __INT32_TYPE__ int 2025-05-07T20:26:58.4631395Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:26:58.4631502Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:26:58.4631606Z #define __cpp_exceptions 199711L 2025-05-07T20:26:58.4631701Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:26:58.4631813Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:26:58.4631911Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:26:58.4632026Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:26:58.4632184Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:26:58.4632283Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:26:58.4632375Z #define __SWORD_TYPE long int 2025-05-07T20:26:58.4632468Z #define __INTMAX_TYPE__ long int 2025-05-07T20:26:58.4632576Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:26:58.4632671Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:26:58.4632764Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:26:58.4633047Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:26:58.4633138Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:26:58.4633288Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:26:58.4633369Z #define _T_SIZE 2025-05-07T20:26:58.4633473Z #define cudaHostAllocDefault 0x00 2025-05-07T20:26:58.4633601Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:26:58.4633723Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:26:58.4633814Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:26:58.4633910Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:26:58.4634033Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:26:58.4634124Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:26:58.4634229Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4634406Z #define __ATOMIC_CONSUME 1 2025-05-07T20:26:58.4634673Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:26:58.4634762Z #define __GNUC_MINOR__ 4 2025-05-07T20:26:58.4634864Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:26:58.4634965Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:26:58.4635081Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4635162Z #define __PIE__ 2 2025-05-07T20:26:58.4635273Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:26:58.4635373Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:26:58.4635560Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:26:58.4635784Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:58.4635876Z #define __nlink_t_defined 2025-05-07T20:26:58.4636009Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:26:58.4636121Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:26:58.4636213Z #define _XOPEN_LIM_H 1 2025-05-07T20:26:58.4636482Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:26:58.4636599Z #define __cpp_template_template_args 201611L 2025-05-07T20:26:58.4636705Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:26:58.4636812Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:26:58.4636911Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:26:58.4637001Z #define __FILE_defined 1 2025-05-07T20:26:58.4637180Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:26:58.4637278Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:26:58.4637373Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:26:58.4637491Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:26:58.4637606Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:26:58.4637725Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:26:58.4637828Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:26:58.4637917Z #define __INT16_C(c) c 2025-05-07T20:26:58.4638027Z #define __U32_TYPE unsigned int 2025-05-07T20:26:58.4638126Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:26:58.4638246Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:26:58.4638333Z #define __STDC__ 1 2025-05-07T20:26:58.4638428Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:26:58.4638527Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:26:58.4638628Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:26:58.4638775Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:26:58.4638869Z #define __FLT32X_DIG__ 15 2025-05-07T20:26:58.4638968Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:26:58.4639064Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:26:58.4639181Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:26:58.4639290Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:26:58.4639388Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:26:58.4639496Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:26:58.4639583Z #define stdin stdin 2025-05-07T20:26:58.4639675Z #define __ino64_t_defined 2025-05-07T20:26:58.4639768Z #define STA_CLK 0x8000 2025-05-07T20:26:58.4639862Z #define __clockid_t_defined 1 2025-05-07T20:26:58.4640005Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:26:58.4640172Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:26:58.4640274Z #define __cudaCDP2MemsetAsync 2025-05-07T20:26:58.4640382Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:26:58.4640486Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:26:58.4640589Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:26:58.4640790Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:26:58.4640883Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:26:58.4641487Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:26:58.4641671Z #define DOMAIN 1 2025-05-07T20:26:58.4641764Z #define M_LN2 0.69314718055994530942 2025-05-07T20:26:58.4641855Z #define __NVCC__ 1 2025-05-07T20:26:58.4641958Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:26:58.4642069Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:26:58.4642177Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:26:58.4642281Z #define __throw_exception_again throw 2025-05-07T20:26:58.4642376Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:26:58.4642474Z #define __EXCEPTION_H 1 2025-05-07T20:26:58.4642571Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:26:58.4642672Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:26:58.4642974Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:26:58.4643084Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:26:58.4643188Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:26:58.4643283Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:26:58.4643399Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:26:58.4643500Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:26:58.4643642Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:26:58.4643748Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:26:58.4643863Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:26:58.4643956Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:26:58.4644060Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:26:58.4644163Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:26:58.4644264Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:26:58.4644397Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:26:58.4644500Z #define __useconds_t_defined 2025-05-07T20:26:58.4644599Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:26:58.4644783Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:26:58.4644927Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:26:58.4645016Z #define __SSE_MATH__ 1 2025-05-07T20:26:58.4645118Z #define _IO_wint_t wint_t 2025-05-07T20:26:58.4645212Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:26:58.4645303Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:26:58.4645406Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:26:58.4645520Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:26:58.4645616Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:26:58.4645718Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:26:58.4645805Z #define __USE_ATFILE 1 2025-05-07T20:26:58.4645906Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:26:58.4646002Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:26:58.4646090Z #define _GCC_PTRDIFF_T 2025-05-07T20:26:58.4646318Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:26:58.4646415Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:26:58.4646514Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:26:58.4646622Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:26:58.4646735Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:26:58.4646822Z #define _STDLIB_H 1 2025-05-07T20:26:58.4646968Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:26:58.4647064Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:26:58.4647157Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:26:58.4647290Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:26:58.4647399Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:26:58.4647499Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:26:58.4647680Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:26:58.4647831Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:26:58.4647945Z #define __glibcxx_requires_nonempty() 2025-05-07T20:26:58.4648061Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:26:58.4648152Z #define __ldiv_t_defined 1 2025-05-07T20:26:58.4648336Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:26:58.4648428Z #define ___int_ptrdiff_t_h 2025-05-07T20:26:58.4648688Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:26:58.4648877Z #define __cudaCDP2EventDestroy 2025-05-07T20:26:58.4648971Z #define __HOST_DEFINES_H__ 2025-05-07T20:26:58.4649080Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:26:58.4649182Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:26:58.4649279Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:26:58.4649368Z #define CUDART_CB 2025-05-07T20:26:58.4649469Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:26:58.4649592Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:26:58.4649686Z #define MB_LEN_MAX 16 2025-05-07T20:26:58.4649904Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:26:58.4650001Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:26:58.4650128Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:26:58.4650240Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:26:58.4650344Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:26:58.4650500Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:26:58.4650608Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:26:58.4650699Z #define _GNU_SOURCE 1 2025-05-07T20:26:58.4650785Z #define __stub_putmsg 2025-05-07T20:26:58.4650868Z #define __CUDACC__ 1 2025-05-07T20:26:58.4650962Z #define __N(msgid) (msgid) 2025-05-07T20:26:58.4651047Z #define __P(args) args 2025-05-07T20:26:58.4651293Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:26:58.4651402Z #define __cpp_init_captures 201304L 2025-05-07T20:26:58.4651506Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:26:58.4651596Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:26:58.4651701Z #define __cpp_lib_as_const 201510 2025-05-07T20:26:58.4651782Z #define __WCHAR_T 2025-05-07T20:26:58.4651879Z #define __ATOMIC_RELEASE 3 2025-05-07T20:26:58.4651973Z #define __fsblkcnt_t_defined 2025-05-07T20:26:58.4652091Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:26:58.4652209Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:26:58.4652221Z 2025-05-07T20:26:58.4821089Z 2025-05-07T20:26:58.4821528Z + conda run -n build_binary nvcc --version 2025-05-07T20:26:58.4821567Z 2025-05-07T20:27:00.3847252Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:00.3847645Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:27:00.3847963Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:27:00.3848266Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:27:00.3848593Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:27:00.3848796Z 2025-05-07T20:27:00.4481489Z 2025-05-07T20:27:00.4492922Z /usr/bin/nvidia-smi 2025-05-07T20:27:00.4498532Z + nvidia-smi 2025-05-07T20:27:00.4498669Z 2025-05-07T20:27:00.4672343Z Wed May 7 20:27:00 2025 2025-05-07T20:27:00.4673228Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.4674298Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:00.4675257Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:00.4676384Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:00.4677400Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:00.4678232Z | | | MIG M. | 2025-05-07T20:27:00.4678876Z |=========================================+========================+======================| 2025-05-07T20:27:00.4843068Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:00.4843597Z | 0% 29C P8 24W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:00.4843983Z | | | N/A | 2025-05-07T20:27:00.4848155Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:00.4848845Z 2025-05-07T20:27:00.4849249Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.4849668Z | Processes: | 2025-05-07T20:27:00.4850100Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:00.4850501Z | ID ID Usage | 2025-05-07T20:27:00.4850840Z |=========================================================================================| 2025-05-07T20:27:00.4852728Z | No running processes found | 2025-05-07T20:27:00.4853382Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:00.7421165Z 2025-05-07T20:27:00.7425734Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:27:00.7474903Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:00.7475436Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:00.7489101Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:00.7489434Z env: 2025-05-07T20:27:00.7489651Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:00.7489935Z BUILD_ENV: build_binary 2025-05-07T20:27:00.7490171Z BUILD_TARGET: genai 2025-05-07T20:27:00.7490391Z BUILD_VARIANT: cuda 2025-05-07T20:27:00.7490613Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:27:00.7490860Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:00.7491156Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:00.7491483Z ##[endgroup] 2025-05-07T20:27:01.0848791Z ################################################################################ 2025-05-07T20:27:01.0849186Z # Install PyTorch (PIP) 2025-05-07T20:27:01.0849414Z # 2025-05-07T20:27:01.0864259Z # [2025-05-07T20:27:01.086Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:27:01.0864936Z ################################################################################ 2025-05-07T20:27:01.0865293Z 2025-05-07T20:27:01.0893053Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:02.0771703Z Channels: 2025-05-07T20:27:02.0771946Z - conda-forge 2025-05-07T20:27:02.0772180Z Platform: linux-64 2025-05-07T20:27:05.3889934Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:06.1104906Z Solving environment: \ | / done 2025-05-07T20:27:06.3320013Z 2025-05-07T20:27:06.3320650Z ## Package Plan ## 2025-05-07T20:27:06.3320914Z 2025-05-07T20:27:06.3321192Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:06.3321569Z 2025-05-07T20:27:06.3321707Z added / updated specs: 2025-05-07T20:27:06.3321965Z - numpy 2025-05-07T20:27:06.3322096Z 2025-05-07T20:27:06.3322112Z 2025-05-07T20:27:06.3322241Z The following packages will be downloaded: 2025-05-07T20:27:06.3322485Z 2025-05-07T20:27:06.3322613Z package | build 2025-05-07T20:27:06.3322956Z ---------------------------|----------------- 2025-05-07T20:27:06.3323430Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:06.3324066Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:06.3324673Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:06.3325111Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:06.3325560Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:06.3326026Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:06.3326836Z numpy-2.2.5 | py313h17eae1a_0 8.1 MB conda-forge 2025-05-07T20:27:06.3327216Z ------------------------------------------------------------ 2025-05-07T20:27:06.3327554Z Total: 15.4 MB 2025-05-07T20:27:06.3327761Z 2025-05-07T20:27:06.3327895Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:06.3328108Z 2025-05-07T20:27:06.3328329Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:06.3328826Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:06.3329322Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:06.3329822Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:06.3330344Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:06.3330870Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:06.3331813Z numpy conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 2025-05-07T20:27:06.3332187Z 2025-05-07T20:27:06.3332192Z 2025-05-07T20:27:06.3332198Z 2025-05-07T20:27:06.3332625Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:06.3341263Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:06.3341586Z 2025-05-07T20:27:06.3342085Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:06.3342413Z 2025-05-07T20:27:06.3342418Z 2025-05-07T20:27:06.3353943Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:06.3354303Z 2025-05-07T20:27:06.3354320Z 2025-05-07T20:27:06.3360158Z 2025-05-07T20:27:06.3372320Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:06.3372681Z 2025-05-07T20:27:06.3372686Z 2025-05-07T20:27:06.3372701Z 2025-05-07T20:27:06.3373263Z 2025-05-07T20:27:06.3394547Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.3394904Z 2025-05-07T20:27:06.3394916Z 2025-05-07T20:27:06.3394933Z 2025-05-07T20:27:06.3394938Z 2025-05-07T20:27:06.3398798Z 2025-05-07T20:27:06.3400120Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.3400469Z 2025-05-07T20:27:06.3400475Z 2025-05-07T20:27:06.3400491Z 2025-05-07T20:27:06.3400500Z 2025-05-07T20:27:06.3400505Z 2025-05-07T20:27:06.3400511Z 2025-05-07T20:27:06.4343570Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:06.4343950Z 2025-05-07T20:27:06.4343956Z 2025-05-07T20:27:06.4343971Z 2025-05-07T20:27:06.4395026Z 2025-05-07T20:27:06.4984120Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.4984476Z 2025-05-07T20:27:06.4984481Z 2025-05-07T20:27:06.5035943Z 2025-05-07T20:27:06.5887826Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:06.5888203Z 2025-05-07T20:27:06.5888221Z 2025-05-07T20:27:06.5888227Z 2025-05-07T20:27:06.5888232Z 2025-05-07T20:27:06.5888472Z 2025-05-07T20:27:06.5912822Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:06.5913190Z 2025-05-07T20:27:06.5913195Z 2025-05-07T20:27:06.5936070Z 2025-05-07T20:27:06.5955733Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:06.5956112Z 2025-05-07T20:27:06.5956118Z 2025-05-07T20:27:06.5956123Z 2025-05-07T20:27:06.5956128Z 2025-05-07T20:27:06.5961434Z 2025-05-07T20:27:06.6290504Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6290895Z 2025-05-07T20:27:06.6290900Z 2025-05-07T20:27:06.6290905Z 2025-05-07T20:27:06.6290911Z 2025-05-07T20:27:06.6294820Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6295178Z 2025-05-07T20:27:06.6295183Z 2025-05-07T20:27:06.6295188Z 2025-05-07T20:27:06.6295664Z 2025-05-07T20:27:06.6351805Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6352379Z 2025-05-07T20:27:06.6352383Z 2025-05-07T20:27:06.6352391Z 2025-05-07T20:27:06.6352395Z 2025-05-07T20:27:06.6352399Z 2025-05-07T20:27:06.6366822Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6367172Z 2025-05-07T20:27:06.6367178Z 2025-05-07T20:27:06.6367182Z 2025-05-07T20:27:06.6367188Z 2025-05-07T20:27:06.6367193Z 2025-05-07T20:27:06.6367198Z 2025-05-07T20:27:06.6375602Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:06.6375960Z 2025-05-07T20:27:06.6375963Z 2025-05-07T20:27:06.6375967Z 2025-05-07T20:27:06.6375971Z 2025-05-07T20:27:06.6375974Z 2025-05-07T20:27:06.6376351Z 2025-05-07T20:27:06.6433619Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6434299Z 2025-05-07T20:27:06.6434309Z 2025-05-07T20:27:06.6434318Z 2025-05-07T20:27:06.6434877Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:06.6435529Z 2025-05-07T20:27:06.6435538Z 2025-05-07T20:27:06.6435554Z 2025-05-07T20:27:06.6495656Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:06.6496034Z 2025-05-07T20:27:06.6496052Z 2025-05-07T20:27:06.6496057Z 2025-05-07T20:27:06.6496062Z 2025-05-07T20:27:06.6496068Z 2025-05-07T20:27:06.6496073Z 2025-05-07T20:27:06.6543471Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:06.6543845Z 2025-05-07T20:27:06.6683579Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:06.6683924Z 2025-05-07T20:27:06.6683930Z 2025-05-07T20:27:06.6785226Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:06.6785584Z 2025-05-07T20:27:06.6785589Z 2025-05-07T20:27:06.6914618Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:06.7108672Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:06.7109785Z 2025-05-07T20:27:06.7412966Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:06.7413318Z 2025-05-07T20:27:06.7413336Z 2025-05-07T20:27:06.7938056Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:06.8122472Z numpy-2.2.5 | 8.1 MB | #######3 | 73% 2025-05-07T20:27:06.8122794Z 2025-05-07T20:27:06.8123275Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:06.8123546Z 2025-05-07T20:27:06.8582759Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:07.2352837Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:07.2359308Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:07.2359790Z 2025-05-07T20:27:07.2360062Z 2025-05-07T20:27:07.2360338Z  2025-05-07T20:27:07.2360625Z 2025-05-07T20:27:07.2360631Z 2025-05-07T20:27:07.2360848Z  2025-05-07T20:27:07.2361140Z 2025-05-07T20:27:07.2361145Z 2025-05-07T20:27:07.2361150Z 2025-05-07T20:27:07.2361370Z  2025-05-07T20:27:07.2361647Z 2025-05-07T20:27:07.2361652Z 2025-05-07T20:27:07.2361665Z 2025-05-07T20:27:07.2361670Z 2025-05-07T20:27:07.2361900Z  2025-05-07T20:27:07.2362170Z 2025-05-07T20:27:07.2362175Z 2025-05-07T20:27:07.2362180Z 2025-05-07T20:27:07.2362185Z 2025-05-07T20:27:07.2362190Z 2025-05-07T20:27:07.2362436Z  2025-05-07T20:27:07.2362668Z 2025-05-07T20:27:07.2362672Z 2025-05-07T20:27:07.2362676Z 2025-05-07T20:27:07.2362679Z 2025-05-07T20:27:07.2362683Z 2025-05-07T20:27:07.2362686Z 2025-05-07T20:27:07.2362888Z  done 2025-05-07T20:27:07.3368360Z Preparing transaction: \ done 2025-05-07T20:27:07.5377329Z Verifying transaction: / - done 2025-05-07T20:27:07.6384949Z Executing transaction: | done 2025-05-07T20:27:07.8175141Z ################################################################################ 2025-05-07T20:27:07.8175678Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:07.8176072Z # 2025-05-07T20:27:07.8190183Z # [2025-05-07T20:27:07.818Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:27:07.8190814Z ################################################################################ 2025-05-07T20:27:07.8191113Z 2025-05-07T20:27:07.8205920Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:07.9124477Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:07.9125422Z ################################################################################ 2025-05-07T20:27:07.9126297Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:07.9126950Z # 2025-05-07T20:27:07.9141619Z # [2025-05-07T20:27:07.913Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:27:07.9142487Z ################################################################################ 2025-05-07T20:27:07.9142754Z 2025-05-07T20:27:07.9162979Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:07.9190438Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:27:07.9207615Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:07.9208150Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:07.9216584Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:07.9225758Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:27:07.9248439Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:24.5354597Z DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334 2025-05-07T20:28:24.5356870Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:28:24.5357253Z Collecting torch 2025-05-07T20:28:24.5357893Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:28:24.5358582Z Collecting filelock (from torch) 2025-05-07T20:28:24.5359068Z Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:28:24.5359971Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2) 2025-05-07T20:28:24.5361020Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1) 2025-05-07T20:28:24.5361668Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:28:24.5362152Z Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:28:24.5362647Z Collecting networkx (from torch) 2025-05-07T20:28:24.5363132Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:28:24.5365843Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 20.4 MB/s eta 0:00:00 2025-05-07T20:28:24.5366198Z Collecting jinja2 (from torch) 2025-05-07T20:28:24.5366674Z Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:28:24.5367168Z Collecting fsspec (from torch) 2025-05-07T20:28:24.5368106Z Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:28:24.5368548Z 2025-05-07T20:28:24.5368707Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:28:24.5369406Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:28:24.5370211Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 58.2 MB/s eta 0:00:00 2025-05-07T20:28:24.5370623Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:28:24.5371319Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:28:24.5372090Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 11.2 MB/s eta 0:00:00 2025-05-07T20:28:24.5372480Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:28:24.5373164Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:28:24.5374230Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 48.6 MB/s eta 0:00:00 2025-05-07T20:28:24.5374605Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:28:24.5375276Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:28:24.5376030Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 57.4 MB/s eta 0:00:00 2025-05-07T20:28:24.5376402Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:28:24.5377147Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:28:24.5377975Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 85.5 MB/s eta 0:00:00 2025-05-07T20:28:24.5378336Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:28:24.5379020Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:28:24.5379766Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 123.6 MB/s eta 0:00:00 2025-05-07T20:28:24.5380133Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:28:24.5380795Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:28:24.5381544Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 214.4 MB/s eta 0:00:00 2025-05-07T20:28:24.5381926Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:28:24.5382603Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:28:24.5383360Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.2 MB/s eta 0:00:00 2025-05-07T20:28:24.5383738Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:28:24.5384441Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:28:24.5385195Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 143.7 MB/s eta 0:00:00 2025-05-07T20:28:24.5385582Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:28:24.5386265Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:28:24.5387024Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 157.4 MB/s eta 0:00:00 2025-05-07T20:28:24.5387382Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:28:24.5388126Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:28:24.5388876Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:28:24.5389612Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:28:24.5390266Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:28:24.5391022Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:28:24.5391855Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 195.7 MB/s eta 0:00:00 2025-05-07T20:28:24.5392224Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:28:24.5392987Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:28:24.5393770Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:28:24.5394607Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:28:24.5395529Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:28:24.5396074Z Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:28:24.5396591Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:28:24.5397070Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB) 2025-05-07T20:28:24.5397543Z Preparing metadata (setup.py): started 2025-05-07T20:28:24.5397913Z Preparing metadata (setup.py): finished with status 'done' 2025-05-07T20:28:24.5398966Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp313-cp313-manylinux_2_28_x86_64.whl (825.4 MB) 2025-05-07T20:28:24.5399758Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 36.2 MB/s eta 0:00:00 2025-05-07T20:28:24.5400498Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:28:24.5401335Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 12.4 MB/s eta 0:00:00 2025-05-07T20:28:24.5402068Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:28:24.5402878Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 94.7 MB/s eta 0:00:00 2025-05-07T20:28:24.5403646Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:28:24.5404493Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 134.3 MB/s eta 0:00:00 2025-05-07T20:28:24.5404869Z Building wheels for collected packages: MarkupSafe 2025-05-07T20:28:24.5405238Z Building wheel for MarkupSafe (setup.py): started 2025-05-07T20:28:24.5405670Z Building wheel for MarkupSafe (setup.py): finished with status 'done' 2025-05-07T20:28:24.5406695Z Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=8642341f746950f07f790b09c3e552393bd8cdf535cdc73dd539cf084cd476d7 2025-05-07T20:28:24.5407683Z Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6 2025-05-07T20:28:24.5408251Z Successfully built MarkupSafe 2025-05-07T20:28:24.5409867Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:28:24.5411410Z 2025-05-07T20:28:24.5413436Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:28:24.5415562Z 2025-05-07T20:28:26.7721946Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:28:26.7724057Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:28:30.2095658Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:28:33.6728396Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:28:33.6728982Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:28:37.0156286Z True 2025-05-07T20:28:37.0156507Z True 2025-05-07T20:28:37.0156638Z 2025-05-07T20:28:37.0781500Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:28:37.0818386Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:37.0818984Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:28:37.0832286Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:37.0832632Z env: 2025-05-07T20:28:37.0832860Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:37.0833154Z BUILD_ENV: build_binary 2025-05-07T20:28:37.0833400Z BUILD_TARGET: genai 2025-05-07T20:28:37.0833626Z BUILD_VARIANT: cuda 2025-05-07T20:28:37.0833864Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:37.0834112Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:37.0834412Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:37.0834743Z ##[endgroup] 2025-05-07T20:28:37.4190292Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:28:37.4192310Z ################################################################################ 2025-05-07T20:28:37.4192796Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:28:37.4193158Z # 2025-05-07T20:28:37.4209041Z # [2025-05-07T20:28:37.420Z] + collect_pytorch_env_info build_binary 2025-05-07T20:28:37.4209432Z ################################################################################ 2025-05-07T20:28:37.4209649Z 2025-05-07T20:28:37.4226594Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:37.5125549Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:37.5135786Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:28:37.5136391Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:28:37.5136785Z 2025-05-07T20:28:37.6006068Z 2025-05-07T20:28:37.6006662Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:28:37.6030625Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:28:43.3833056Z Collecting environment information... 2025-05-07T20:28:43.3833634Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:28:43.3834029Z Is debug build: False 2025-05-07T20:28:43.3834365Z CUDA used to build PyTorch: 12.6 2025-05-07T20:28:43.3834675Z ROCM used to build PyTorch: N/A 2025-05-07T20:28:43.3834846Z 2025-05-07T20:28:43.3834950Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:28:43.3835268Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:28:43.3835583Z Clang version: Could not collect 2025-05-07T20:28:43.3835846Z CMake version: Could not collect 2025-05-07T20:28:43.3836113Z Libc version: glibc-2.34 2025-05-07T20:28:43.3836263Z 2025-05-07T20:28:43.3836568Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime) 2025-05-07T20:28:43.3837183Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:28:43.3837585Z Is CUDA available: True 2025-05-07T20:28:43.3837838Z CUDA runtime version: 12.6.85 2025-05-07T20:28:43.3838221Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:28:43.3838622Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:28:43.3839001Z Nvidia driver version: 570.133.07 2025-05-07T20:28:43.3839540Z cuDNN version: Could not collect 2025-05-07T20:28:43.3839865Z HIP runtime version: N/A 2025-05-07T20:28:43.3840205Z MIOpen runtime version: N/A 2025-05-07T20:28:43.3840633Z Is XNNPACK available: True 2025-05-07T20:28:43.3840822Z 2025-05-07T20:28:43.3840932Z CPU: 2025-05-07T20:28:43.3841261Z Architecture: x86_64 2025-05-07T20:28:43.3850231Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:28:43.3850631Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:28:43.3851021Z Byte Order: Little Endian 2025-05-07T20:28:43.3851338Z CPU(s): 16 2025-05-07T20:28:43.3851639Z On-line CPU(s) list: 0-15 2025-05-07T20:28:43.3852267Z Vendor ID: AuthenticAMD 2025-05-07T20:28:43.3852617Z Model name: AMD EPYC 7R32 2025-05-07T20:28:43.3852938Z CPU family: 23 2025-05-07T20:28:43.3853218Z Model: 49 2025-05-07T20:28:43.3853508Z Thread(s) per core: 2 2025-05-07T20:28:43.3853946Z Core(s) per socket: 8 2025-05-07T20:28:43.3854226Z Socket(s): 1 2025-05-07T20:28:43.3854507Z Stepping: 0 2025-05-07T20:28:43.3854811Z BogoMIPS: 5599.85 2025-05-07T20:28:43.3856960Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:28:43.3858963Z Hypervisor vendor: KVM 2025-05-07T20:28:43.3859271Z Virtualization type: full 2025-05-07T20:28:43.3859680Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:28:43.3860102Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:28:43.3860473Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:28:43.3860816Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:28:43.3861137Z NUMA node(s): 1 2025-05-07T20:28:43.3861430Z NUMA node0 CPU(s): 0-15 2025-05-07T20:28:43.3861760Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:28:43.3862313Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:28:43.3862670Z Vulnerability L1tf: Not affected 2025-05-07T20:28:43.3863012Z Vulnerability Mds: Not affected 2025-05-07T20:28:43.3863362Z Vulnerability Meltdown: Not affected 2025-05-07T20:28:43.3863721Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:28:43.3864082Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:28:43.3864612Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:28:43.3865183Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:28:43.3865720Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:28:43.3866385Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:28:43.3867230Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:28:43.3867898Z Vulnerability Srbds: Not affected 2025-05-07T20:28:43.3868262Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:28:43.3868492Z 2025-05-07T20:28:43.3868597Z Versions of relevant libraries: 2025-05-07T20:28:43.3868869Z [pip3] numpy==2.2.5 2025-05-07T20:28:43.3869114Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:28:43.3869412Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:28:43.3869723Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:28:43.3870038Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:28:43.3870350Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:28:43.3870629Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:28:43.3870915Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:28:43.3871207Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:28:43.3871503Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:28:43.3871916Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:28:43.3872211Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:28:43.3872484Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:28:43.3872780Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:28:43.3873068Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:28:43.3873359Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:28:43.3873726Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:43.3874264Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:43.3874835Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:43.3875340Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:43.3875865Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:43.3876382Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:28:43.3876862Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3877322Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:28:43.3877794Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:43.3878281Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:28:43.3878746Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3879202Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:43.3879657Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3880109Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3880570Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:28:43.3881142Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:28:43.3881598Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:43.3882051Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:28:43.3882511Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3882966Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:28:43.3883423Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3883875Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:28:43.3884340Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:43.3884808Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:28:43.3885313Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3885834Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:28:43.3886315Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:28:43.3886804Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:28:43.3887262Z [conda] numpy 2.2.5 py313h17eae1a_0 conda-forge 2025-05-07T20:28:43.3887720Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:28:43.3888210Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:28:43.3888695Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:43.3889206Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:43.3889770Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:28:43.3890351Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:28:43.3890825Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:28:43.3891306Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:28:43.3891796Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:28:43.3892277Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:28:43.3892758Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:28:43.3893233Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:28:43.3893813Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:28:43.3894284Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:28:43.3894742Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:28:43.3895011Z 2025-05-07T20:28:43.4569722Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:43.4570419Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:28:43.4582343Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:28:43.4582679Z env: 2025-05-07T20:28:43.4582905Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:28:43.4583198Z BUILD_ENV: build_binary 2025-05-07T20:28:43.4583433Z BUILD_TARGET: genai 2025-05-07T20:28:43.4583658Z BUILD_VARIANT: cuda 2025-05-07T20:28:43.4583892Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:28:43.4584144Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:28:43.4584434Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:28:43.4584763Z ##[endgroup] 2025-05-07T20:28:43.7971342Z ################################################################################ 2025-05-07T20:28:43.7972057Z # Prepare FBGEMM-GPU Build 2025-05-07T20:28:43.7972303Z # 2025-05-07T20:28:43.7987934Z # [2025-05-07T20:28:43.798Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:28:43.7988343Z ################################################################################ 2025-05-07T20:28:43.7988557Z 2025-05-07T20:28:43.8003671Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:43.8909496Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:43.8930594Z [BUILD] Running git submodules update ... 2025-05-07T20:28:43.8953461Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:28:43.9318211Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:28:43.9318685Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:28:43.9319116Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:28:43.9319498Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:28:43.9319895Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:28:43.9320390Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:28:43.9320794Z Synchronizing submodule url for '../external/json' 2025-05-07T20:28:43.9353736Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:28:43.9901440Z [BUILD] Installing other build dependencies ... 2025-05-07T20:28:43.9922550Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:28:46.4527270Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:28:46.4538552Z Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:28:46.4887231Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:28:46.4896861Z Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:28:46.6189115Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:28:46.6200762Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:28:46.6530646Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:28:46.6539896Z Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:28:46.8738306Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:28:46.8748874Z Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:28:46.8834980Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:28:46.8838008Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:28:46.9280494Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:28:46.9290176Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:28:46.9304069Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:28:46.9649246Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:28:46.9657882Z Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:28:47.0108646Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:28:47.0291329Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:28:47.1105450Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:28:47.1114324Z Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:28:47.1164207Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:28:47.1577045Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:28:47.1585663Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:28:47.1970257Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:28:47.1979421Z Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:28:47.2302386Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:28:47.2312059Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:28:47.2719299Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:47.2728293Z Using cached packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:28:47.3089325Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:28:47.3098930Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:28:47.3457851Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:47.3466813Z Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:47.3876810Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:28:47.3885700Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:28:47.3910907Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:28:47.4252764Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:47.4261654Z Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:28:47.4275394Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:28:47.4531728Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:28:47.4540814Z Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:28:47.4560678Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:28:47.5030219Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:28:47.5039084Z Using cached mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:28:47.5067444Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:28:47.5076448Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:28:47.5085759Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:28:47.5282295Z Using cached click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:28:47.5291665Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:28:47.5304668Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:28:47.5313718Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:28:47.5325346Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:28:47.5356012Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB) 2025-05-07T20:28:47.6186569Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 6.9 MB/s eta 0:00:00 2025-05-07T20:28:47.6195046Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:28:47.6205020Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:47.6214002Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:28:47.6223556Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:28:47.6235161Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:28:47.6244518Z Using cached packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:28:47.6253808Z Using cached distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:28:47.6262920Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:28:47.6271673Z Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:28:47.6280569Z Using cached mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:28:47.7510169Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:28:50.1821283Z 2025-05-07T20:28:50.1874603Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:28:50.3574918Z ################################################################################ 2025-05-07T20:28:50.3575371Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:28:50.3575641Z # 2025-05-07T20:28:50.3592114Z # [2025-05-07T20:28:50.358Z] + install_triton_pip build_binary 2025-05-07T20:28:50.3592566Z ################################################################################ 2025-05-07T20:28:50.3592787Z 2025-05-07T20:28:50.3593007Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:28:50.3593438Z ################################################################################ 2025-05-07T20:28:50.3593791Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:28:50.3594096Z # 2025-05-07T20:28:50.3609123Z # [2025-05-07T20:28:50.360Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:50.3609746Z ################################################################################ 2025-05-07T20:28:50.3609959Z 2025-05-07T20:28:50.3626580Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:28:50.4562642Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:28:50.4563013Z ################################################################################ 2025-05-07T20:28:50.4563359Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:28:50.4563639Z # 2025-05-07T20:28:50.4580696Z # [2025-05-07T20:28:50.457Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:28:50.4581188Z ################################################################################ 2025-05-07T20:28:50.4581410Z 2025-05-07T20:28:50.4629656Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:28:50.4645994Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:28:50.4646746Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:50.4655154Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:50.4664433Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:28:50.4685182Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:57.1005683Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:28:57.1007787Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:28:57.1009407Z 2025-05-07T20:28:57.1009737Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:28:57.1010323Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:28:57.1011487Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:28:57.1013334Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:28:57.1015066Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 86.4 MB/s eta 0:00:00 2025-05-07T20:28:57.1015591Z Installing collected packages: pytorch-triton 2025-05-07T20:28:57.1016088Z Attempting uninstall: pytorch-triton 2025-05-07T20:28:57.1016631Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:28:57.1017233Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:28:57.1017889Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:28:57.1018528Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:28:57.1018908Z 2025-05-07T20:28:59.3452695Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:28:59.3455992Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:01.5091022Z ################################################################################ 2025-05-07T20:29:01.5091479Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:01.5091850Z ################################################################################ 2025-05-07T20:29:01.5092066Z 2025-05-07T20:29:03.5751684Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:05.7684798Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:05.7688735Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:05.7742054Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:05.7742539Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:05.7754214Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:05.7754549Z env: 2025-05-07T20:29:05.7754769Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:05.7755055Z BUILD_ENV: build_binary 2025-05-07T20:29:05.7755293Z BUILD_TARGET: genai 2025-05-07T20:29:05.7755524Z BUILD_VARIANT: cuda 2025-05-07T20:29:05.7755749Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:05.7756000Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:05.7756296Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:05.7756612Z ##[endgroup] 2025-05-07T20:29:06.1133318Z ################################################################################ 2025-05-07T20:29:06.1133831Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:06.1134086Z # 2025-05-07T20:29:06.1150618Z # [2025-05-07T20:29:06.114Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1151255Z ################################################################################ 2025-05-07T20:29:06.1151466Z 2025-05-07T20:29:06.1151818Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1152493Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1152819Z 2025-05-07T20:29:06.1259376Z f90095cdf9a3f2a3bbac1aa51f6d03c22b933a7e fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1261908Z 2025-05-07T20:29:06.1262482Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1262819Z 2025-05-07T20:29:06.1387626Z ed17ebd3a2864d614d536415eaaeb2b336bf2d88ef5df95627044ab7b9ab7adc fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1389959Z 2025-05-07T20:29:06.1390371Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1391050Z 2025-05-07T20:29:06.1624594Z 7ea0c844ddb54583dafff944ccac7bb0 fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:06.1627052Z 2025-05-07T20:29:06.1636266Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:06.1657896Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:08.8678618Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl 2025-05-07T20:29:08.8679561Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:08.8680398Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:08.8680829Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:08.8681108Z 2025-05-07T20:29:15.6835489Z ################################################################################ 2025-05-07T20:29:15.6836206Z [CHECK] !!!! INFO !!!! 2025-05-07T20:29:15.6836924Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:29:15.6837742Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:29:15.6838341Z [CHECK] 2025-05-07T20:29:15.6838881Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:29:15.6839418Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:29:15.6839803Z ################################################################################ 2025-05-07T20:29:15.6840013Z 2025-05-07T20:29:15.6840130Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:29:19.6188219Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:29:23.5453450Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:27.4764995Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:29:27.4768013Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:29:39.2341695Z ################################################################################ 2025-05-07T20:29:39.2342102Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:29:39.2342445Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:29:39.2342784Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:29:39.2343121Z ################################################################################ 2025-05-07T20:29:39.2343331Z 2025-05-07T20:29:47.0965736Z ################################################################################ 2025-05-07T20:29:47.0966491Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:29:47.0968438Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:29:47.0969979Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:29:47.0970496Z ################################################################################ 2025-05-07T20:29:47.0970708Z 2025-05-07T20:29:47.0970867Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:29:51.0318244Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:29:54.9641601Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:29:59.0086292Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:02.9434601Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:02.9442090Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:06.7843321Z fbgemm.nccl_init 2025-05-07T20:30:06.7843516Z 2025-05-07T20:30:06.8478776Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:10.6902297Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:10.6902512Z 2025-05-07T20:30:10.7537497Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:30:14.5971326Z fbgemm.rope_qkv_decoding 2025-05-07T20:30:14.5971554Z 2025-05-07T20:30:14.6601436Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:30:14.6602034Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:30:14.6638708Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:14.6639162Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:30:14.6651879Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:30:14.6652238Z env: 2025-05-07T20:30:14.6652465Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:30:14.6652762Z BUILD_ENV: build_binary 2025-05-07T20:30:14.6653008Z BUILD_TARGET: genai 2025-05-07T20:30:14.6653239Z BUILD_VARIANT: cuda 2025-05-07T20:30:14.6653469Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:30:14.6653923Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:30:14.6654228Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:30:14.6654554Z ##[endgroup] 2025-05-07T20:30:15.0044183Z ################################################################################ 2025-05-07T20:30:15.0044550Z # Test All FBGEMM-GPU Modules 2025-05-07T20:30:15.0044813Z # 2025-05-07T20:30:15.0062152Z # [2025-05-07T20:30:15.005Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:30:15.0062563Z ################################################################################ 2025-05-07T20:30:15.0062780Z 2025-05-07T20:30:22.8459731Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:30:22.8460351Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:30:22.8460751Z [TEST] Determined the test directories: 2025-05-07T20:30:22.8461062Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:30:22.8461352Z fbgemm_gpu/experimental/example/test 2025-05-07T20:30:22.8461652Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:30:22.8461835Z 2025-05-07T20:30:22.8466532Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:30:22.8473301Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:30:22.8473872Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:30:22.8474261Z 2025-05-07T20:30:23.2753260Z 2025-05-07T20:30:23.2754053Z [TEST] Installing PyTest ... 2025-05-07T20:30:23.2776369Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:30:24.5572341Z Channels: 2025-05-07T20:30:24.5572599Z - conda-forge 2025-05-07T20:30:24.5572824Z Platform: linux-64 2025-05-07T20:30:27.8748624Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:30:29.0223416Z Solving environment: \ | / done 2025-05-07T20:30:29.2508881Z 2025-05-07T20:30:29.2509397Z ## Package Plan ## 2025-05-07T20:30:29.2509577Z 2025-05-07T20:30:29.2509790Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:30:29.2510085Z 2025-05-07T20:30:29.2510183Z added / updated specs: 2025-05-07T20:30:29.2510438Z - expecttest 2025-05-07T20:30:29.2510658Z - pytest 2025-05-07T20:30:29.2510777Z 2025-05-07T20:30:29.2510781Z 2025-05-07T20:30:29.2510904Z The following packages will be downloaded: 2025-05-07T20:30:29.2511134Z 2025-05-07T20:30:29.2511251Z package | build 2025-05-07T20:30:29.2511568Z ---------------------------|----------------- 2025-05-07T20:30:29.2511938Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:30:29.2513056Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:30:29.2513658Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:30:29.2514220Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:30:29.2514656Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:30:29.2515058Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:30:29.2515457Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:30:29.2516072Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:30:29.2516454Z ------------------------------------------------------------ 2025-05-07T20:30:29.2516782Z Total: 428 KB 2025-05-07T20:30:29.2516994Z 2025-05-07T20:30:29.2517124Z The following NEW packages will be INSTALLED: 2025-05-07T20:30:29.2517335Z 2025-05-07T20:30:29.2517538Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:30:29.2518233Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:30:29.2518859Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:30:29.2519326Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:30:29.2519784Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:30:29.2520228Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:30:29.2520653Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:30:29.2521066Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:30:29.2521313Z 2025-05-07T20:30:29.2521317Z 2025-05-07T20:30:29.2521321Z 2025-05-07T20:30:29.2521465Z Downloading and Extracting Packages: ...working... 2025-05-07T20:30:29.2521832Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:30:29.2522055Z 2025-05-07T20:30:29.2539023Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:30:29.2539345Z 2025-05-07T20:30:29.2539348Z 2025-05-07T20:30:29.2543723Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:30:29.2543992Z 2025-05-07T20:30:29.2543996Z 2025-05-07T20:30:29.2544000Z 2025-05-07T20:30:29.2553383Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:30:29.2553634Z 2025-05-07T20:30:29.2553637Z 2025-05-07T20:30:29.2553641Z 2025-05-07T20:30:29.2562597Z 2025-05-07T20:30:29.2579557Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:30:29.2579907Z 2025-05-07T20:30:29.2579913Z 2025-05-07T20:30:29.2579918Z 2025-05-07T20:30:29.2579923Z 2025-05-07T20:30:29.2579929Z 2025-05-07T20:30:29.2582274Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:30:29.2582619Z 2025-05-07T20:30:29.2582623Z 2025-05-07T20:30:29.2582627Z 2025-05-07T20:30:29.2582638Z 2025-05-07T20:30:29.2582642Z 2025-05-07T20:30:29.2582650Z 2025-05-07T20:30:29.2584086Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:30:29.2584464Z 2025-05-07T20:30:29.2584478Z 2025-05-07T20:30:29.2584483Z 2025-05-07T20:30:29.2584489Z 2025-05-07T20:30:29.2584494Z 2025-05-07T20:30:29.2584499Z 2025-05-07T20:30:29.2584504Z 2025-05-07T20:30:29.2963934Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:30:29.2964322Z 2025-05-07T20:30:29.2964327Z 2025-05-07T20:30:29.2965831Z 2025-05-07T20:30:29.3081223Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:29.3081601Z 2025-05-07T20:30:29.3081608Z 2025-05-07T20:30:29.3081613Z 2025-05-07T20:30:29.3124038Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:30:29.3179213Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:29.3179809Z 2025-05-07T20:30:29.3180243Z 2025-05-07T20:30:29.3394369Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:29.3394688Z 2025-05-07T20:30:29.3394692Z 2025-05-07T20:30:29.3394696Z 2025-05-07T20:30:29.3394700Z 2025-05-07T20:30:29.3396105Z 2025-05-07T20:30:29.3417333Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:29.3417586Z 2025-05-07T20:30:29.3419535Z 2025-05-07T20:30:29.3486317Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:30:29.3489737Z 2025-05-07T20:30:29.3512558Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:29.3512807Z 2025-05-07T20:30:29.3512811Z 2025-05-07T20:30:29.3512815Z 2025-05-07T20:30:29.3513068Z 2025-05-07T20:30:29.3513073Z 2025-05-07T20:30:29.3513077Z 2025-05-07T20:30:29.3567074Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.3567345Z 2025-05-07T20:30:29.3567360Z 2025-05-07T20:30:29.3567364Z 2025-05-07T20:30:29.3569423Z 2025-05-07T20:30:29.3618066Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:29.3618351Z 2025-05-07T20:30:29.3618355Z 2025-05-07T20:30:29.3618359Z 2025-05-07T20:30:29.3618362Z 2025-05-07T20:30:29.3618366Z 2025-05-07T20:30:29.3618369Z 2025-05-07T20:30:29.3619290Z 2025-05-07T20:30:29.3639920Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.3640211Z 2025-05-07T20:30:29.3640215Z 2025-05-07T20:30:29.3640219Z 2025-05-07T20:30:29.3640222Z 2025-05-07T20:30:29.3640226Z 2025-05-07T20:30:29.3640235Z 2025-05-07T20:30:29.3640239Z 2025-05-07T20:30:29.3792425Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.3792706Z 2025-05-07T20:30:29.3792710Z 2025-05-07T20:30:29.3792713Z 2025-05-07T20:30:29.3792724Z 2025-05-07T20:30:29.3793414Z 2025-05-07T20:30:29.3797797Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:29.3798130Z 2025-05-07T20:30:29.3798134Z 2025-05-07T20:30:29.3798145Z 2025-05-07T20:30:29.3798149Z 2025-05-07T20:30:29.3798153Z 2025-05-07T20:30:29.3929580Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:30:29.3931821Z 2025-05-07T20:30:29.3937845Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:29.3938435Z 2025-05-07T20:30:29.4046930Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:30:29.4047252Z 2025-05-07T20:30:29.4047256Z 2025-05-07T20:30:29.4047260Z 2025-05-07T20:30:29.4047263Z 2025-05-07T20:30:29.4047267Z 2025-05-07T20:30:29.4047424Z 2025-05-07T20:30:29.4052479Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.4052813Z 2025-05-07T20:30:29.4052818Z 2025-05-07T20:30:29.4052822Z 2025-05-07T20:30:29.4052826Z 2025-05-07T20:30:29.4052829Z 2025-05-07T20:30:29.4052833Z 2025-05-07T20:30:29.4124482Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:30:29.4127111Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:29.4197235Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:30:29.4197549Z 2025-05-07T20:30:29.4197555Z 2025-05-07T20:30:29.4197560Z 2025-05-07T20:30:29.4197663Z 2025-05-07T20:30:29.4199922Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:29.4200267Z 2025-05-07T20:30:29.4200271Z 2025-05-07T20:30:29.4200275Z 2025-05-07T20:30:29.4200278Z 2025-05-07T20:30:29.4221064Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:30:29.4221392Z 2025-05-07T20:30:29.4221396Z 2025-05-07T20:30:29.4221400Z 2025-05-07T20:30:29.4221404Z 2025-05-07T20:30:29.4221407Z 2025-05-07T20:30:29.4221421Z 2025-05-07T20:30:29.4221425Z 2025-05-07T20:30:29.4227935Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:30:29.4228451Z 2025-05-07T20:30:29.4228981Z 2025-05-07T20:30:29.4229220Z  2025-05-07T20:30:29.4229496Z 2025-05-07T20:30:29.4229501Z 2025-05-07T20:30:29.4229718Z  2025-05-07T20:30:29.4229971Z 2025-05-07T20:30:29.4229975Z 2025-05-07T20:30:29.4229978Z 2025-05-07T20:30:29.4230168Z  2025-05-07T20:30:29.4230453Z 2025-05-07T20:30:29.4230459Z 2025-05-07T20:30:29.4230464Z 2025-05-07T20:30:29.4230478Z 2025-05-07T20:30:29.4230725Z  2025-05-07T20:30:29.4230998Z 2025-05-07T20:30:29.4231003Z 2025-05-07T20:30:29.4231009Z 2025-05-07T20:30:29.4231014Z 2025-05-07T20:30:29.4231259Z 2025-05-07T20:30:29.4231450Z  2025-05-07T20:30:29.4231660Z 2025-05-07T20:30:29.4231664Z 2025-05-07T20:30:29.4231667Z 2025-05-07T20:30:29.4231678Z 2025-05-07T20:30:29.4231682Z 2025-05-07T20:30:29.4231685Z 2025-05-07T20:30:29.4231869Z  2025-05-07T20:30:29.4232078Z 2025-05-07T20:30:29.4232082Z 2025-05-07T20:30:29.4232085Z 2025-05-07T20:30:29.4232089Z 2025-05-07T20:30:29.4232092Z 2025-05-07T20:30:29.4232096Z 2025-05-07T20:30:29.4232106Z 2025-05-07T20:30:29.4232289Z  done 2025-05-07T20:30:29.5235050Z Preparing transaction: \ done 2025-05-07T20:30:29.6242137Z Verifying transaction: / done 2025-05-07T20:30:31.5272010Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:30:31.6579573Z [TEST] Checking imports ... 2025-05-07T20:30:35.5716294Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:35.5729883Z [TEST] Setting feature flags ... 2025-05-07T20:30:35.5730391Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:30:35.5730798Z 2025-05-07T20:30:35.9989262Z 2025-05-07T20:30:35.9989630Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:30:35.9991245Z ################################################################################ 2025-05-07T20:30:35.9991693Z # Run FBGEMM-GPU Tests: 2025-05-07T20:30:35.9991933Z # 2025-05-07T20:30:36.0011324Z # [2025-05-07T20:30:36.000Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:30:36.0011893Z ################################################################################ 2025-05-07T20:30:36.0012178Z 2025-05-07T20:30:36.0019140Z [TEST] Enumerating ALL test files ... 2025-05-07T20:30:36.0049789Z ./attention/gqa_test.py 2025-05-07T20:30:36.0050197Z ./coalesce/coalesce_test.py 2025-05-07T20:30:36.0050544Z ./comm/multi_gpu_car_test.py 2025-05-07T20:30:36.0050911Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:30:36.0051260Z ./kv_cache/kv_cache_test.py 2025-05-07T20:30:36.0051526Z ./moe/activation_test.py 2025-05-07T20:30:36.0051813Z ./moe/gather_scatter_test.py 2025-05-07T20:30:36.0052063Z ./moe/layers_test.py 2025-05-07T20:30:36.0052289Z ./moe/shuffling_test.py 2025-05-07T20:30:36.0052528Z ./quantize/quantize_test.py 2025-05-07T20:30:36.0052695Z 2025-05-07T20:30:36.0052810Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:30:36.0062732Z 2025-05-07T20:30:36.0075809Z ################################################################################ 2025-05-07T20:30:36.0093425Z # [2025-05-07T20:30:36.009Z] Run Python Test Suite: 2025-05-07T20:30:36.0093992Z # ./attention/gqa_test.py 2025-05-07T20:30:36.0094360Z ################################################################################ 2025-05-07T20:30:36.0118688Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:30:36.0119411Z 2025-05-07T20:30:38.5749763Z ============================= test session starts ============================== 2025-05-07T20:30:38.5750802Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:30:38.5751317Z cachedir: .pytest_cache 2025-05-07T20:30:38.5751898Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:30:38.5752614Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:30:38.5753025Z plugins: hypothesis-6.131.14 2025-05-07T20:30:40.1512733Z collecting ... collected 2 items 2025-05-07T20:30:40.1513248Z 2025-05-07T20:31:17.4167266Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:31:17.4168375Z self=, 2025-05-07T20:31:17.4168779Z int4_kv=False, 2025-05-07T20:31:17.4169036Z num_groups=1, 2025-05-07T20:31:17.4169289Z B=1, 2025-05-07T20:31:17.4169519Z MAX_T=4, 2025-05-07T20:31:17.4169763Z N_H_L=1, 2025-05-07T20:31:17.4170001Z ) 2025-05-07T20:31:17.4170248Z Trying example: test_gqa( 2025-05-07T20:31:17.4170597Z self=, 2025-05-07T20:31:17.4170972Z int4_kv=True, 2025-05-07T20:31:17.4171225Z num_groups=1, 2025-05-07T20:31:17.4171506Z B=1, 2025-05-07T20:31:17.4171724Z MAX_T=4, 2025-05-07T20:31:17.4171963Z N_H_L=1, 2025-05-07T20:31:17.4172209Z ) 2025-05-07T20:31:17.4172458Z Trying example: test_gqa( 2025-05-07T20:31:17.4172823Z self=, 2025-05-07T20:31:17.4173224Z int4_kv=True, 2025-05-07T20:31:17.4173481Z num_groups=4, 2025-05-07T20:31:17.4173899Z B=23, 2025-05-07T20:31:17.4174118Z MAX_T=33, 2025-05-07T20:31:17.4174359Z N_H_L=68, 2025-05-07T20:31:17.4174589Z ) 2025-05-07T20:31:17.4174812Z Trying example: test_gqa( 2025-05-07T20:31:17.4175154Z self=, 2025-05-07T20:31:17.4175527Z int4_kv=True, 2025-05-07T20:31:17.4175774Z num_groups=4, 2025-05-07T20:31:17.4176029Z B=77, 2025-05-07T20:31:17.4176251Z MAX_T=4, 2025-05-07T20:31:17.4176475Z N_H_L=1, 2025-05-07T20:31:17.4176700Z ) 2025-05-07T20:31:17.4176932Z Trying example: test_gqa( 2025-05-07T20:31:17.4177270Z self=, 2025-05-07T20:31:17.4177643Z int4_kv=True, 2025-05-07T20:31:17.4177902Z num_groups=4, 2025-05-07T20:31:17.4178140Z B=77, 2025-05-07T20:31:17.4178370Z MAX_T=52, 2025-05-07T20:31:17.4178601Z N_H_L=67, 2025-05-07T20:31:17.4178822Z ) 2025-05-07T20:31:17.4179053Z Trying example: test_gqa( 2025-05-07T20:31:17.4179396Z self=, 2025-05-07T20:31:17.4179794Z int4_kv=False, 2025-05-07T20:31:17.4180081Z num_groups=4, 2025-05-07T20:31:17.4180482Z B=57, 2025-05-07T20:31:17.4180710Z MAX_T=45, 2025-05-07T20:31:17.4180940Z N_H_L=120, 2025-05-07T20:31:17.4181173Z ) 2025-05-07T20:31:17.4181403Z Trying example: test_gqa( 2025-05-07T20:31:17.4181743Z self=, 2025-05-07T20:31:17.4182115Z int4_kv=True, 2025-05-07T20:31:17.4182364Z num_groups=4, 2025-05-07T20:31:17.4182600Z B=52, 2025-05-07T20:31:17.4182823Z MAX_T=42, 2025-05-07T20:31:17.4183054Z N_H_L=53, 2025-05-07T20:31:17.4183275Z ) 2025-05-07T20:31:17.4183502Z Trying example: test_gqa( 2025-05-07T20:31:17.4183851Z self=, 2025-05-07T20:31:17.4184215Z int4_kv=True, 2025-05-07T20:31:17.4184464Z num_groups=1, 2025-05-07T20:31:17.4184708Z B=77, 2025-05-07T20:31:17.4184925Z MAX_T=95, 2025-05-07T20:31:17.4185156Z N_H_L=53, 2025-05-07T20:31:17.4185387Z ) 2025-05-07T20:31:17.4185610Z Trying example: test_gqa( 2025-05-07T20:31:17.4185958Z self=, 2025-05-07T20:31:17.4186330Z int4_kv=True, 2025-05-07T20:31:17.4186570Z num_groups=4, 2025-05-07T20:31:17.4186813Z B=113, 2025-05-07T20:31:17.4187040Z MAX_T=48, 2025-05-07T20:31:17.4187498Z N_H_L=96, 2025-05-07T20:31:17.4187726Z ) 2025-05-07T20:31:17.4187954Z Trying example: test_gqa( 2025-05-07T20:31:17.4188296Z self=, 2025-05-07T20:31:17.4188660Z int4_kv=False, 2025-05-07T20:31:17.4188911Z num_groups=1, 2025-05-07T20:31:17.4189160Z B=51, 2025-05-07T20:31:17.4189383Z MAX_T=61, 2025-05-07T20:31:17.4189617Z N_H_L=69, 2025-05-07T20:31:17.4189847Z ) 2025-05-07T20:31:17.4190069Z Trying example: test_gqa( 2025-05-07T20:31:17.4190414Z self=, 2025-05-07T20:31:17.4190788Z int4_kv=False, 2025-05-07T20:31:17.4191032Z num_groups=4, 2025-05-07T20:31:17.4191279Z B=17, 2025-05-07T20:31:17.4191508Z MAX_T=113, 2025-05-07T20:31:17.4191839Z N_H_L=65, 2025-05-07T20:31:17.4192072Z ) 2025-05-07T20:31:17.4192301Z Trying example: test_gqa( 2025-05-07T20:31:17.4192637Z self=, 2025-05-07T20:31:17.4193013Z int4_kv=False, 2025-05-07T20:31:17.4193274Z num_groups=4, 2025-05-07T20:31:17.4193513Z B=17, 2025-05-07T20:31:17.4193738Z MAX_T=65, 2025-05-07T20:31:17.4193970Z N_H_L=65, 2025-05-07T20:31:17.4194191Z ) 2025-05-07T20:31:17.4194422Z Trying example: test_gqa( 2025-05-07T20:31:17.4194764Z self=, 2025-05-07T20:31:17.4195129Z int4_kv=False, 2025-05-07T20:31:17.4195383Z num_groups=4, 2025-05-07T20:31:17.4195628Z B=65, 2025-05-07T20:31:17.4195850Z MAX_T=65, 2025-05-07T20:31:17.4196077Z N_H_L=65, 2025-05-07T20:31:17.4196307Z ) 2025-05-07T20:31:17.4196535Z Trying example: test_gqa( 2025-05-07T20:31:17.4196869Z self=, 2025-05-07T20:31:17.4197243Z int4_kv=False, 2025-05-07T20:31:17.4197501Z num_groups=1, 2025-05-07T20:31:17.4197741Z B=6, 2025-05-07T20:31:17.4197972Z MAX_T=108, 2025-05-07T20:31:17.4198453Z N_H_L=14, 2025-05-07T20:31:17.4198683Z ) 2025-05-07T20:31:17.4198915Z Trying example: test_gqa( 2025-05-07T20:31:17.4199263Z self=, 2025-05-07T20:31:17.4199633Z int4_kv=False, 2025-05-07T20:31:17.4199884Z num_groups=1, 2025-05-07T20:31:17.4200126Z B=6, 2025-05-07T20:31:17.4200340Z MAX_T=14, 2025-05-07T20:31:17.4200572Z N_H_L=14, 2025-05-07T20:31:17.4200800Z ) 2025-05-07T20:31:17.4201022Z Trying example: test_gqa( 2025-05-07T20:31:17.4201365Z self=, 2025-05-07T20:31:17.4201737Z int4_kv=False, 2025-05-07T20:31:17.4201981Z num_groups=1, 2025-05-07T20:31:17.4202227Z B=6, 2025-05-07T20:31:17.4202451Z MAX_T=6, 2025-05-07T20:31:17.4202677Z N_H_L=14, 2025-05-07T20:31:17.4202931Z ) 2025-05-07T20:31:17.4203189Z Trying example: test_gqa( 2025-05-07T20:31:17.4203526Z self=, 2025-05-07T20:31:17.4203900Z int4_kv=False, 2025-05-07T20:31:17.4204151Z num_groups=1, 2025-05-07T20:31:17.4204388Z B=6, 2025-05-07T20:31:17.4204616Z MAX_T=6, 2025-05-07T20:31:17.4204844Z N_H_L=6, 2025-05-07T20:31:17.4205071Z ) 2025-05-07T20:31:17.4205295Z Trying example: test_gqa( 2025-05-07T20:31:17.4205637Z self=, 2025-05-07T20:31:17.4206013Z int4_kv=False, 2025-05-07T20:31:17.4206256Z num_groups=1, 2025-05-07T20:31:17.4206502Z B=70, 2025-05-07T20:31:17.4206729Z MAX_T=94, 2025-05-07T20:31:17.4206958Z N_H_L=78, 2025-05-07T20:31:17.4207184Z ) 2025-05-07T20:31:17.4207414Z Trying example: test_gqa( 2025-05-07T20:31:17.4207748Z self=, 2025-05-07T20:31:17.4208119Z int4_kv=False, 2025-05-07T20:31:17.4208372Z num_groups=1, 2025-05-07T20:31:17.4208608Z B=78, 2025-05-07T20:31:17.4208835Z MAX_T=94, 2025-05-07T20:31:17.4209068Z N_H_L=78, 2025-05-07T20:31:17.4209287Z ) 2025-05-07T20:31:17.4209517Z Trying example: test_gqa( 2025-05-07T20:31:17.4209860Z self=, 2025-05-07T20:31:17.4210378Z int4_kv=False, 2025-05-07T20:31:17.4210630Z num_groups=1, 2025-05-07T20:31:17.4210875Z B=94, 2025-05-07T20:31:17.4211094Z MAX_T=94, 2025-05-07T20:31:17.4211325Z N_H_L=78, 2025-05-07T20:31:17.4211557Z ) 2025-05-07T20:31:17.4211783Z Trying example: test_gqa( 2025-05-07T20:31:17.4212123Z self=, 2025-05-07T20:31:17.4212499Z int4_kv=False, 2025-05-07T20:31:17.4212747Z num_groups=1, 2025-05-07T20:31:17.4213033Z B=94, 2025-05-07T20:31:17.4213267Z MAX_T=94, 2025-05-07T20:31:17.4213493Z N_H_L=94, 2025-05-07T20:31:17.4213817Z ) 2025-05-07T20:31:17.4214047Z Trying example: test_gqa( 2025-05-07T20:31:17.4214389Z self=, 2025-05-07T20:31:17.4214904Z int4_kv=False, 2025-05-07T20:31:17.4215164Z num_groups=4, 2025-05-07T20:31:17.4215410Z B=41, 2025-05-07T20:31:17.4215632Z MAX_T=105, 2025-05-07T20:31:17.4215871Z N_H_L=126, 2025-05-07T20:31:17.4216112Z ) 2025-05-07T20:31:17.4216345Z Trying example: test_gqa( 2025-05-07T20:31:17.4216692Z self=, 2025-05-07T20:31:17.4217088Z int4_kv=False, 2025-05-07T20:31:17.4217364Z num_groups=4, 2025-05-07T20:31:17.4217645Z B=105, 2025-05-07T20:31:17.4217901Z MAX_T=105, 2025-05-07T20:31:17.4218162Z N_H_L=126, 2025-05-07T20:31:17.4218414Z ) 2025-05-07T20:31:17.4218662Z Trying example: test_gqa( 2025-05-07T20:31:17.4219024Z self=, 2025-05-07T20:31:17.4219436Z int4_kv=False, 2025-05-07T20:31:17.4219649Z num_groups=4, 2025-05-07T20:31:17.4219856Z B=105, 2025-05-07T20:31:17.4220050Z MAX_T=105, 2025-05-07T20:31:17.4220254Z N_H_L=105, 2025-05-07T20:31:17.4220456Z ) 2025-05-07T20:31:17.4220655Z Trying example: test_gqa( 2025-05-07T20:31:17.4220949Z self=, 2025-05-07T20:31:17.4221257Z int4_kv=True, 2025-05-07T20:31:17.4221465Z num_groups=1, 2025-05-07T20:31:17.4221678Z B=95, 2025-05-07T20:31:17.4221876Z MAX_T=114, 2025-05-07T20:31:17.4222071Z N_H_L=43, 2025-05-07T20:31:17.4222272Z ) 2025-05-07T20:31:17.4222471Z Trying example: test_gqa( 2025-05-07T20:31:17.4222755Z self=, 2025-05-07T20:31:17.4223063Z int4_kv=True, 2025-05-07T20:31:17.4223277Z num_groups=1, 2025-05-07T20:31:17.4223479Z B=43, 2025-05-07T20:31:17.4223675Z MAX_T=114, 2025-05-07T20:31:17.4223878Z N_H_L=43, 2025-05-07T20:31:17.4224068Z ) 2025-05-07T20:31:17.4224263Z Trying example: test_gqa( 2025-05-07T20:31:17.4224553Z self=, 2025-05-07T20:31:17.4224851Z int4_kv=True, 2025-05-07T20:31:17.4225062Z num_groups=1, 2025-05-07T20:31:17.4225274Z B=43, 2025-05-07T20:31:17.4225464Z MAX_T=43, 2025-05-07T20:31:17.4225658Z N_H_L=43, 2025-05-07T20:31:17.4225853Z ) 2025-05-07T20:31:17.4226049Z Trying example: test_gqa( 2025-05-07T20:31:17.4226336Z self=, 2025-05-07T20:31:17.4226652Z int4_kv=False, 2025-05-07T20:31:17.4226866Z num_groups=1, 2025-05-07T20:31:17.4227067Z B=21, 2025-05-07T20:31:17.4227263Z MAX_T=38, 2025-05-07T20:31:17.4227461Z N_H_L=42, 2025-05-07T20:31:17.4227649Z ) 2025-05-07T20:31:17.4227844Z Trying example: test_gqa( 2025-05-07T20:31:17.4228140Z self=, 2025-05-07T20:31:17.4228444Z int4_kv=False, 2025-05-07T20:31:17.4228659Z num_groups=1, 2025-05-07T20:31:17.4228863Z B=38, 2025-05-07T20:31:17.4229045Z MAX_T=38, 2025-05-07T20:31:17.4229244Z N_H_L=42, 2025-05-07T20:31:17.4229440Z ) 2025-05-07T20:31:17.4229630Z Trying example: test_gqa( 2025-05-07T20:31:17.4229923Z self=, 2025-05-07T20:31:17.4230232Z int4_kv=False, 2025-05-07T20:31:17.4230440Z num_groups=1, 2025-05-07T20:31:17.4230651Z B=38, 2025-05-07T20:31:17.4230847Z MAX_T=42, 2025-05-07T20:31:17.4231041Z N_H_L=42, 2025-05-07T20:31:17.4231856Z ) 2025-05-07T20:31:17.4232058Z Trying example: test_gqa( 2025-05-07T20:31:17.4232340Z self=, 2025-05-07T20:31:17.4232654Z int4_kv=False, 2025-05-07T20:31:17.4232868Z num_groups=1, 2025-05-07T20:31:17.4233078Z B=42, 2025-05-07T20:31:17.4233265Z MAX_T=42, 2025-05-07T20:31:17.4233468Z N_H_L=42, 2025-05-07T20:31:17.4233664Z ) 2025-05-07T20:31:17.4233857Z Trying example: test_gqa( 2025-05-07T20:31:17.4234147Z self=, 2025-05-07T20:31:17.4234461Z int4_kv=True, 2025-05-07T20:31:17.4234669Z num_groups=1, 2025-05-07T20:31:17.4234877Z B=74, 2025-05-07T20:31:17.4235070Z MAX_T=20, 2025-05-07T20:31:17.4235262Z N_H_L=15, 2025-05-07T20:31:17.4235550Z ) 2025-05-07T20:31:17.4235744Z Trying example: test_gqa( 2025-05-07T20:31:17.4236027Z self=, 2025-05-07T20:31:17.4236334Z int4_kv=True, 2025-05-07T20:31:17.4236546Z num_groups=1, 2025-05-07T20:31:17.4236751Z B=20, 2025-05-07T20:31:17.4236939Z MAX_T=20, 2025-05-07T20:31:17.4237133Z N_H_L=15, 2025-05-07T20:31:17.4237320Z ) 2025-05-07T20:31:17.4237516Z Trying example: test_gqa( 2025-05-07T20:31:17.4237805Z self=, 2025-05-07T20:31:17.4238109Z int4_kv=True, 2025-05-07T20:31:17.4238317Z num_groups=1, 2025-05-07T20:31:17.4238524Z B=20, 2025-05-07T20:31:17.4238708Z MAX_T=15, 2025-05-07T20:31:17.4238907Z N_H_L=15, 2025-05-07T20:31:17.4239100Z ) 2025-05-07T20:31:17.4239285Z Trying example: test_gqa( 2025-05-07T20:31:17.4239568Z self=, 2025-05-07T20:31:17.4239874Z int4_kv=True, 2025-05-07T20:31:17.4240080Z num_groups=1, 2025-05-07T20:31:17.4240290Z B=15, 2025-05-07T20:31:17.4240482Z MAX_T=20, 2025-05-07T20:31:17.4240681Z N_H_L=15, 2025-05-07T20:31:17.4240866Z ) 2025-05-07T20:31:17.4241061Z Trying example: test_gqa( 2025-05-07T20:31:17.4241353Z self=, 2025-05-07T20:31:17.4241659Z int4_kv=True, 2025-05-07T20:31:17.4241868Z num_groups=1, 2025-05-07T20:31:17.4242073Z B=15, 2025-05-07T20:31:17.4242259Z MAX_T=15, 2025-05-07T20:31:17.4242462Z N_H_L=15, 2025-05-07T20:31:17.4242654Z ) 2025-05-07T20:31:17.4242846Z Trying example: test_gqa( 2025-05-07T20:31:17.4243132Z self=, 2025-05-07T20:31:17.4243444Z int4_kv=False, 2025-05-07T20:31:17.4243651Z num_groups=4, 2025-05-07T20:31:17.4243856Z B=117, 2025-05-07T20:31:17.4244052Z MAX_T=104, 2025-05-07T20:31:17.4244246Z N_H_L=69, 2025-05-07T20:31:17.4244441Z ) 2025-05-07T20:31:17.4244640Z Trying example: test_gqa( 2025-05-07T20:31:17.4244926Z self=, 2025-05-07T20:31:17.4245236Z int4_kv=False, 2025-05-07T20:31:17.4245449Z num_groups=4, 2025-05-07T20:31:17.4245646Z B=117, 2025-05-07T20:31:17.4245839Z MAX_T=117, 2025-05-07T20:31:17.4246046Z N_H_L=69, 2025-05-07T20:31:17.4246229Z ) 2025-05-07T20:31:17.4246420Z Trying example: test_gqa( 2025-05-07T20:31:17.4246709Z self=, 2025-05-07T20:31:17.4247012Z int4_kv=False, 2025-05-07T20:31:17.4247227Z num_groups=4, 2025-05-07T20:31:17.4247435Z B=69, 2025-05-07T20:31:17.4247626Z MAX_T=117, 2025-05-07T20:31:17.4247816Z N_H_L=69, 2025-05-07T20:31:17.4248008Z ) 2025-05-07T20:31:17.4248200Z Trying example: test_gqa( 2025-05-07T20:31:17.4248478Z self=, 2025-05-07T20:31:17.4248785Z int4_kv=False, 2025-05-07T20:31:17.4248997Z num_groups=4, 2025-05-07T20:31:17.4249197Z B=117, 2025-05-07T20:31:17.4249390Z MAX_T=69, 2025-05-07T20:31:17.4249589Z N_H_L=69, 2025-05-07T20:31:17.4249772Z ) 2025-05-07T20:31:17.4249959Z PASSED 2025-05-07T20:31:17.4355883Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:31:17.4356210Z 2025-05-07T20:31:17.4356566Z =========================== short test summary info ============================ 2025-05-07T20:31:17.4357271Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:31:17.4357957Z ======================== 1 passed, 1 skipped in 39.38s ========================= 2025-05-07T20:31:18.0894981Z 2025-05-07T20:31:18.0895491Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:31:18.0915061Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:31:18.0915340Z 2025-05-07T20:31:18.0915351Z 2025-05-07T20:31:18.0915562Z 2025-05-07T20:31:18.0915574Z 2025-05-07T20:31:18.0936746Z ################################################################################ 2025-05-07T20:31:18.0951497Z # [2025-05-07T20:31:18.094Z] Run Python Test Suite: 2025-05-07T20:31:18.0951838Z # ./coalesce/coalesce_test.py 2025-05-07T20:31:18.0952125Z ################################################################################ 2025-05-07T20:31:18.0977842Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:31:18.0978443Z 2025-05-07T20:31:20.2666165Z ============================= test session starts ============================== 2025-05-07T20:31:20.2666796Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:20.2667308Z cachedir: .pytest_cache 2025-05-07T20:31:20.2667875Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:20.2668616Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:20.2669022Z plugins: hypothesis-6.131.14 2025-05-07T20:31:21.8292613Z collecting ... collected 1 item 2025-05-07T20:31:21.8293145Z 2025-05-07T20:31:22.5853866Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:31:22.5854195Z 2025-05-07T20:31:22.5854762Z ============================== 1 passed in 2.45s =============================== 2025-05-07T20:31:23.2147350Z 2025-05-07T20:31:23.2147838Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:31:23.2168569Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:31:23.2168853Z 2025-05-07T20:31:23.2168858Z 2025-05-07T20:31:23.2168861Z 2025-05-07T20:31:23.2168865Z 2025-05-07T20:31:23.2189812Z ################################################################################ 2025-05-07T20:31:23.2205707Z # [2025-05-07T20:31:23.220Z] Run Python Test Suite: 2025-05-07T20:31:23.2206072Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:31:23.2206367Z ################################################################################ 2025-05-07T20:31:23.2232569Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:31:23.2233197Z 2025-05-07T20:31:25.3950925Z ============================= test session starts ============================== 2025-05-07T20:31:25.3951571Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:25.3952089Z cachedir: .pytest_cache 2025-05-07T20:31:25.3952654Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:25.3953367Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:25.3953780Z plugins: hypothesis-6.131.14 2025-05-07T20:31:26.9954491Z collecting ... collected 5 items 2025-05-07T20:31:26.9954701Z 2025-05-07T20:31:26.9965341Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:31:26.9972587Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:31:26.9979353Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:31:26.9989883Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:31:27.0005132Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:31:27.0005478Z 2025-05-07T20:31:27.0005625Z =========================== short test summary info ============================ 2025-05-07T20:31:27.0006292Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:27.0007397Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:27.0008304Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:27.0009198Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:27.0010100Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:31:27.0010736Z ============================== 5 skipped in 1.74s ============================== 2025-05-07T20:31:27.5728782Z 2025-05-07T20:31:27.5729467Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:31:27.5749035Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 4 seconds 2025-05-07T20:31:27.5749343Z 2025-05-07T20:31:27.5749348Z 2025-05-07T20:31:27.5749352Z 2025-05-07T20:31:27.5749387Z 2025-05-07T20:31:27.5770237Z ################################################################################ 2025-05-07T20:31:27.5785748Z # [2025-05-07T20:31:27.578Z] Run Python Test Suite: 2025-05-07T20:31:27.5786091Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:27.5786421Z ################################################################################ 2025-05-07T20:31:27.5811802Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:27.5812453Z 2025-05-07T20:31:29.7524188Z ============================= test session starts ============================== 2025-05-07T20:31:29.7524984Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:29.7525503Z cachedir: .pytest_cache 2025-05-07T20:31:29.7526094Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:29.7526857Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:29.7527261Z plugins: hypothesis-6.131.14 2025-05-07T20:31:31.4060986Z collecting ... collected 2 items 2025-05-07T20:31:31.4061276Z 2025-05-07T20:31:31.4072093Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:31:31.4086844Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:31:31.4087464Z 2025-05-07T20:31:31.4087674Z =========================== short test summary info ============================ 2025-05-07T20:31:31.4088291Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:31.4089103Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:31:31.4089709Z ============================== 2 skipped in 1.79s ============================== 2025-05-07T20:31:31.9992510Z 2025-05-07T20:31:31.9993080Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:32.0014360Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 5 seconds 2025-05-07T20:31:32.0014681Z 2025-05-07T20:31:32.0014686Z 2025-05-07T20:31:32.0014690Z 2025-05-07T20:31:32.0014694Z 2025-05-07T20:31:32.0035304Z ################################################################################ 2025-05-07T20:31:32.0050442Z # [2025-05-07T20:31:32.004Z] Run Python Test Suite: 2025-05-07T20:31:32.0050769Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:31:32.0051057Z ################################################################################ 2025-05-07T20:31:32.0076430Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:31:32.0077245Z 2025-05-07T20:31:34.1808374Z ============================= test session starts ============================== 2025-05-07T20:31:34.1809039Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:34.1809586Z cachedir: .pytest_cache 2025-05-07T20:31:34.1810152Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:34.1810859Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:34.1811258Z plugins: hypothesis-6.131.14 2025-05-07T20:31:35.7577869Z collecting ... collected 4 items 2025-05-07T20:31:35.7578076Z 2025-05-07T20:31:38.0279881Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:31:38.0362960Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:31:38.0452601Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:31:38.0538838Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:31:38.0539202Z 2025-05-07T20:31:38.0539352Z =========================== short test summary info ============================ 2025-05-07T20:31:38.0540053Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:31:38.0540937Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:31:38.0541535Z ============================== 4 skipped in 4.01s ============================== 2025-05-07T20:31:40.2816660Z 2025-05-07T20:31:40.2817359Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:31:40.2837422Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds 2025-05-07T20:31:40.2837707Z 2025-05-07T20:31:40.2837739Z 2025-05-07T20:31:40.2837743Z 2025-05-07T20:31:40.2837747Z 2025-05-07T20:31:40.2858375Z ################################################################################ 2025-05-07T20:31:40.2873760Z # [2025-05-07T20:31:40.287Z] Run Python Test Suite: 2025-05-07T20:31:40.2874107Z # ./moe/activation_test.py 2025-05-07T20:31:40.2874390Z ################################################################################ 2025-05-07T20:31:40.2899805Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:31:40.2900382Z 2025-05-07T20:31:42.4639644Z ============================= test session starts ============================== 2025-05-07T20:31:42.4640579Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:42.4641094Z cachedir: .pytest_cache 2025-05-07T20:31:42.4641688Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:42.4642400Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:42.4642811Z plugins: hypothesis-6.131.14 2025-05-07T20:31:44.0529274Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:31:44.1497056Z collecting ... collected 2 items 2025-05-07T20:31:44.1497318Z 2025-05-07T20:31:49.0934227Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:31:49.0935102Z self=, 2025-05-07T20:31:49.0935611Z T=1, 2025-05-07T20:31:49.0935843Z D=5120, 2025-05-07T20:31:49.0936094Z contiguous=True, 2025-05-07T20:31:49.0936379Z compiled=True, 2025-05-07T20:31:49.0936641Z ) 2025-05-07T20:31:49.0936897Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0937710Z self=, 2025-05-07T20:31:49.0938087Z T=4096, 2025-05-07T20:31:49.0938275Z D=5120, 2025-05-07T20:31:49.0938469Z contiguous=True, 2025-05-07T20:31:49.0938693Z compiled=True, 2025-05-07T20:31:49.0938900Z ) 2025-05-07T20:31:49.0939114Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0939474Z self=, 2025-05-07T20:31:49.0939847Z T=4096, 2025-05-07T20:31:49.0940035Z D=7168, 2025-05-07T20:31:49.0940227Z contiguous=False, 2025-05-07T20:31:49.0940448Z compiled=False, 2025-05-07T20:31:49.0940652Z ) 2025-05-07T20:31:49.0940851Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0941212Z self=, 2025-05-07T20:31:49.0941585Z T=4096, 2025-05-07T20:31:49.0941781Z D=5120, 2025-05-07T20:31:49.0941972Z contiguous=False, 2025-05-07T20:31:49.0942195Z compiled=True, 2025-05-07T20:31:49.0942396Z ) 2025-05-07T20:31:49.0942594Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0942959Z self=, 2025-05-07T20:31:49.0943332Z T=1, 2025-05-07T20:31:49.0943509Z D=7168, 2025-05-07T20:31:49.0943708Z contiguous=True, 2025-05-07T20:31:49.0943941Z compiled=True, 2025-05-07T20:31:49.0944141Z ) 2025-05-07T20:31:49.0944337Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0944699Z self=, 2025-05-07T20:31:49.0945068Z T=1, 2025-05-07T20:31:49.0945256Z D=7168, 2025-05-07T20:31:49.0945454Z contiguous=False, 2025-05-07T20:31:49.0945678Z compiled=True, 2025-05-07T20:31:49.0945877Z ) 2025-05-07T20:31:49.0946081Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0946448Z self=, 2025-05-07T20:31:49.0946811Z T=4096, 2025-05-07T20:31:49.0947002Z D=5120, 2025-05-07T20:31:49.0947198Z contiguous=False, 2025-05-07T20:31:49.0947422Z compiled=False, 2025-05-07T20:31:49.0947629Z ) 2025-05-07T20:31:49.0947828Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0948189Z self=, 2025-05-07T20:31:49.0948571Z T=1, 2025-05-07T20:31:49.0948760Z D=7168, 2025-05-07T20:31:49.0948948Z contiguous=True, 2025-05-07T20:31:49.0949170Z compiled=False, 2025-05-07T20:31:49.0949375Z ) 2025-05-07T20:31:49.0949567Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0949933Z self=, 2025-05-07T20:31:49.0950305Z T=2048, 2025-05-07T20:31:49.0950482Z D=5120, 2025-05-07T20:31:49.0950674Z contiguous=True, 2025-05-07T20:31:49.0950897Z compiled=True, 2025-05-07T20:31:49.0951089Z ) 2025-05-07T20:31:49.0951282Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0951645Z self=, 2025-05-07T20:31:49.0952014Z T=2048, 2025-05-07T20:31:49.0952197Z D=7168, 2025-05-07T20:31:49.0952396Z contiguous=True, 2025-05-07T20:31:49.0952617Z compiled=True, 2025-05-07T20:31:49.0952810Z ) 2025-05-07T20:31:49.0953006Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0953371Z self=, 2025-05-07T20:31:49.0953918Z T=2048, 2025-05-07T20:31:49.0954106Z D=7168, 2025-05-07T20:31:49.0954297Z contiguous=True, 2025-05-07T20:31:49.0954516Z compiled=False, 2025-05-07T20:31:49.0954722Z ) 2025-05-07T20:31:49.0954921Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0955279Z self=, 2025-05-07T20:31:49.0955651Z T=128, 2025-05-07T20:31:49.0955835Z D=5120, 2025-05-07T20:31:49.0956026Z contiguous=False, 2025-05-07T20:31:49.0956249Z compiled=True, 2025-05-07T20:31:49.0956452Z ) 2025-05-07T20:31:49.0956643Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0957109Z self=, 2025-05-07T20:31:49.0957484Z T=128, 2025-05-07T20:31:49.0957669Z D=5120, 2025-05-07T20:31:49.0957865Z contiguous=True, 2025-05-07T20:31:49.0958089Z compiled=True, 2025-05-07T20:31:49.0958298Z ) 2025-05-07T20:31:49.0958490Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0958857Z self=, 2025-05-07T20:31:49.0959230Z T=16384, 2025-05-07T20:31:49.0959419Z D=5120, 2025-05-07T20:31:49.0959618Z contiguous=False, 2025-05-07T20:31:49.0959845Z compiled=True, 2025-05-07T20:31:49.0960041Z ) 2025-05-07T20:31:49.0960238Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0960606Z self=, 2025-05-07T20:31:49.0960973Z T=16384, 2025-05-07T20:31:49.0961165Z D=5120, 2025-05-07T20:31:49.0961364Z contiguous=False, 2025-05-07T20:31:49.0961580Z compiled=False, 2025-05-07T20:31:49.0961785Z ) 2025-05-07T20:31:49.0961988Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0962345Z self=, 2025-05-07T20:31:49.0962719Z T=128, 2025-05-07T20:31:49.0962907Z D=7168, 2025-05-07T20:31:49.0963105Z contiguous=True, 2025-05-07T20:31:49.0963324Z compiled=False, 2025-05-07T20:31:49.0963525Z ) 2025-05-07T20:31:49.0963733Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0964092Z self=, 2025-05-07T20:31:49.0964464Z T=128, 2025-05-07T20:31:49.0964653Z D=7168, 2025-05-07T20:31:49.0964852Z contiguous=False, 2025-05-07T20:31:49.0965072Z compiled=False, 2025-05-07T20:31:49.0965283Z ) 2025-05-07T20:31:49.0965484Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0965841Z self=, 2025-05-07T20:31:49.0966218Z T=1, 2025-05-07T20:31:49.0966403Z D=5120, 2025-05-07T20:31:49.0966595Z contiguous=False, 2025-05-07T20:31:49.0966831Z compiled=False, 2025-05-07T20:31:49.0967037Z ) 2025-05-07T20:31:49.0967226Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0967594Z self=, 2025-05-07T20:31:49.0967970Z T=1, 2025-05-07T20:31:49.0968149Z D=7168, 2025-05-07T20:31:49.0968340Z contiguous=False, 2025-05-07T20:31:49.0968564Z compiled=False, 2025-05-07T20:31:49.0968763Z ) 2025-05-07T20:31:49.0968960Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0969331Z self=, 2025-05-07T20:31:49.0969694Z T=4096, 2025-05-07T20:31:49.0969883Z D=5120, 2025-05-07T20:31:49.0970080Z contiguous=True, 2025-05-07T20:31:49.0970294Z compiled=False, 2025-05-07T20:31:49.0970501Z ) 2025-05-07T20:31:49.0970696Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0971061Z self=, 2025-05-07T20:31:49.0971427Z T=128, 2025-05-07T20:31:49.0971624Z D=7168, 2025-05-07T20:31:49.0971825Z contiguous=True, 2025-05-07T20:31:49.0972039Z compiled=True, 2025-05-07T20:31:49.0972245Z ) 2025-05-07T20:31:49.0972444Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0972900Z self=, 2025-05-07T20:31:49.0973272Z T=1, 2025-05-07T20:31:49.0973482Z D=5120, 2025-05-07T20:31:49.0973837Z contiguous=False, 2025-05-07T20:31:49.0974077Z compiled=True, 2025-05-07T20:31:49.0974283Z ) 2025-05-07T20:31:49.0974474Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0974845Z self=, 2025-05-07T20:31:49.0975216Z T=4096, 2025-05-07T20:31:49.0975396Z D=7168, 2025-05-07T20:31:49.0975592Z contiguous=True, 2025-05-07T20:31:49.0975813Z compiled=False, 2025-05-07T20:31:49.0976012Z ) 2025-05-07T20:31:49.0976213Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0976689Z self=, 2025-05-07T20:31:49.0977060Z T=4096, 2025-05-07T20:31:49.0977252Z D=7168, 2025-05-07T20:31:49.0977449Z contiguous=False, 2025-05-07T20:31:49.0977676Z compiled=True, 2025-05-07T20:31:49.0977882Z ) 2025-05-07T20:31:49.0978081Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0978444Z self=, 2025-05-07T20:31:49.0978812Z T=128, 2025-05-07T20:31:49.0979003Z D=5120, 2025-05-07T20:31:49.0979199Z contiguous=True, 2025-05-07T20:31:49.0979418Z compiled=False, 2025-05-07T20:31:49.0979624Z ) 2025-05-07T20:31:49.0979822Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0980181Z self=, 2025-05-07T20:31:49.0980554Z T=128, 2025-05-07T20:31:49.0980741Z D=5120, 2025-05-07T20:31:49.0980930Z contiguous=False, 2025-05-07T20:31:49.0981153Z compiled=False, 2025-05-07T20:31:49.0981361Z ) 2025-05-07T20:31:49.0981558Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0981921Z self=, 2025-05-07T20:31:49.0982292Z T=1, 2025-05-07T20:31:49.0982470Z D=5120, 2025-05-07T20:31:49.0982674Z contiguous=True, 2025-05-07T20:31:49.0982895Z compiled=False, 2025-05-07T20:31:49.0983094Z ) 2025-05-07T20:31:49.0983295Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0983659Z self=, 2025-05-07T20:31:49.0984028Z T=2048, 2025-05-07T20:31:49.0984212Z D=7168, 2025-05-07T20:31:49.0984409Z contiguous=False, 2025-05-07T20:31:49.0984635Z compiled=True, 2025-05-07T20:31:49.0984829Z ) 2025-05-07T20:31:49.0985026Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0985389Z self=, 2025-05-07T20:31:49.0985750Z T=2048, 2025-05-07T20:31:49.0985938Z D=7168, 2025-05-07T20:31:49.0986130Z contiguous=False, 2025-05-07T20:31:49.0986352Z compiled=False, 2025-05-07T20:31:49.0986559Z ) 2025-05-07T20:31:49.0986759Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0987118Z self=, 2025-05-07T20:31:49.0987493Z T=16384, 2025-05-07T20:31:49.0987690Z D=7168, 2025-05-07T20:31:49.0987879Z contiguous=False, 2025-05-07T20:31:49.0988106Z compiled=True, 2025-05-07T20:31:49.0988310Z ) 2025-05-07T20:31:49.0988503Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0988873Z self=, 2025-05-07T20:31:49.0989246Z T=16384, 2025-05-07T20:31:49.0989437Z D=7168, 2025-05-07T20:31:49.0989626Z contiguous=True, 2025-05-07T20:31:49.0989850Z compiled=True, 2025-05-07T20:31:49.0990054Z ) 2025-05-07T20:31:49.0990247Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0990618Z self=, 2025-05-07T20:31:49.0990990Z T=4096, 2025-05-07T20:31:49.0991177Z D=7168, 2025-05-07T20:31:49.0991370Z contiguous=True, 2025-05-07T20:31:49.0991592Z compiled=True, 2025-05-07T20:31:49.0991787Z ) 2025-05-07T20:31:49.0991987Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0992448Z self=, 2025-05-07T20:31:49.0992811Z T=2048, 2025-05-07T20:31:49.0993002Z D=5120, 2025-05-07T20:31:49.0993198Z contiguous=False, 2025-05-07T20:31:49.0993418Z compiled=False, 2025-05-07T20:31:49.0993626Z ) 2025-05-07T20:31:49.0993823Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0994181Z self=, 2025-05-07T20:31:49.0994549Z T=2048, 2025-05-07T20:31:49.0994739Z D=5120, 2025-05-07T20:31:49.0994928Z contiguous=True, 2025-05-07T20:31:49.0995148Z compiled=False, 2025-05-07T20:31:49.0995355Z ) 2025-05-07T20:31:49.0995544Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0996005Z self=, 2025-05-07T20:31:49.0996380Z T=128, 2025-05-07T20:31:49.0996571Z D=7168, 2025-05-07T20:31:49.0996757Z contiguous=False, 2025-05-07T20:31:49.0996979Z compiled=True, 2025-05-07T20:31:49.0997188Z ) 2025-05-07T20:31:49.0997375Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0997740Z self=, 2025-05-07T20:31:49.0998108Z T=16384, 2025-05-07T20:31:49.0998752Z D=5120, 2025-05-07T20:31:49.0998950Z contiguous=True, 2025-05-07T20:31:49.0999173Z compiled=True, 2025-05-07T20:31:49.0999369Z ) 2025-05-07T20:31:49.0999568Z Trying example: test_silu_mul( 2025-05-07T20:31:49.0999933Z self=, 2025-05-07T20:31:49.1000301Z T=2048, 2025-05-07T20:31:49.1000489Z D=5120, 2025-05-07T20:31:49.1000699Z contiguous=False, 2025-05-07T20:31:49.1000925Z compiled=True, 2025-05-07T20:31:49.1001132Z ) 2025-05-07T20:31:49.1001331Z Trying example: test_silu_mul( 2025-05-07T20:31:49.1001705Z self=, 2025-05-07T20:31:49.1002082Z T=16384, 2025-05-07T20:31:49.1002271Z D=5120, 2025-05-07T20:31:49.1002475Z contiguous=True, 2025-05-07T20:31:49.1002699Z compiled=False, 2025-05-07T20:31:49.1002902Z ) 2025-05-07T20:31:49.1011800Z Trying example: test_silu_mul( 2025-05-07T20:31:49.1012219Z self=, 2025-05-07T20:31:49.1012620Z T=16384, 2025-05-07T20:31:49.1012823Z D=7168, 2025-05-07T20:31:49.1013030Z contiguous=False, 2025-05-07T20:31:49.1013256Z compiled=False, 2025-05-07T20:31:49.1013469Z ) 2025-05-07T20:31:49.1013779Z Trying example: test_silu_mul( 2025-05-07T20:31:49.1014149Z self=, 2025-05-07T20:31:49.1014530Z T=16384, 2025-05-07T20:31:49.1014728Z D=7168, 2025-05-07T20:31:49.1014930Z contiguous=True, 2025-05-07T20:31:49.1015157Z compiled=False, 2025-05-07T20:31:49.1015366Z ) 2025-05-07T20:31:49.1015554Z PASSED 2025-05-07T20:31:49.1633808Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.1634948Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.1636292Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.1637732Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.1638718Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.1640021Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.1641786Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.1643174Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.1644718Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.1645768Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:49.1647014Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.1648249Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.1649093Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.1650290Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.1651483Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.1652517Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.1653535Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.1654845Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.1656124Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.1657019Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.1658099Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.1659135Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.1659907Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.1661073Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.1662402Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.1663551Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.1664453Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.1665196Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.1666212Z W0507 20:31:49.161000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.1791142Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.1792220Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.1793549Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.1794973Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.1795953Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.1797259Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.1798987Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.1800274Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.1801620Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.1802661Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:49.1803902Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.1805132Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.1805973Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.1807155Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.1808356Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.1809379Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.1810676Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.1811877Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.1813142Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.1814357Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.1815432Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.1816463Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.1817217Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.1818369Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.1819706Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.1820751Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.1821659Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.1822387Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.1823390Z W0507 20:31:49.177000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.2188804Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.2189978Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.2191308Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.2192734Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.2193711Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.2195006Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.2196378Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.2199081Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.2200438Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.2201473Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:49.2202917Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.2204154Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.2205000Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.2206181Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.2207372Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.2208399Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.2209407Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.2210611Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.2211866Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.2212760Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.2213971Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.2215003Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.2215765Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.2216921Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.2218260Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.2219320Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.2220220Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.2220952Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.2222096Z W0507 20:31:49.217000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.2237263Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.2238505Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:31:49.2239940Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.2241349Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.2242315Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.2243605Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.2244976Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.2246260Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.2247622Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.2248653Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] module_map=module_map) 2025-05-07T20:31:49.2249895Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.2251129Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:31:49.2251969Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.2253159Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.2254453Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:31:49.2255481Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.2256494Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:31:49.2257694Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.2259046Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.2259941Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.2261013Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.2262123Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:31:49.2262889Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.2264034Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.2265373Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.2266426Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.2267331Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.2268068Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:31:49.2269078Z W0507 20:31:49.222000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.6290104Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.6291903Z self=, 2025-05-07T20:31:49.6292434Z T=1, 2025-05-07T20:31:49.6292626Z D=5120, 2025-05-07T20:31:49.6292849Z scale_ub=None, 2025-05-07T20:31:49.6293149Z contiguous=True, 2025-05-07T20:31:49.6293460Z compiled=True, 2025-05-07T20:31:49.6293861Z ) 2025-05-07T20:31:49.6294324Z self = 2025-05-07T20:31:49.6295015Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:49.6295276Z 2025-05-07T20:31:49.6295360Z @given( 2025-05-07T20:31:49.6295598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.6295919Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.6296235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.6296569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.6296897Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.6297188Z ) 2025-05-07T20:31:49.6297568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.6298007Z def test_silu_mul_quant( 2025-05-07T20:31:49.6298561Z self, 2025-05-07T20:31:49.6298769Z T: int, 2025-05-07T20:31:49.6298965Z D: int, 2025-05-07T20:31:49.6299192Z scale_ub: Optional[float], 2025-05-07T20:31:49.6299471Z contiguous: bool, 2025-05-07T20:31:49.6299721Z compiled: bool, 2025-05-07T20:31:49.6299963Z ) -> None: 2025-05-07T20:31:49.6300189Z torch.manual_seed(2025) 2025-05-07T20:31:49.6300434Z 2025-05-07T20:31:49.6300704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.6301453Z 2025-05-07T20:31:49.6301657Z x_sign = torch.sign(x) 2025-05-07T20:31:49.6301947Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.6302263Z x = x_sign * x_clamp 2025-05-07T20:31:49.6302507Z x0 = x[:, :D] 2025-05-07T20:31:49.6302722Z x1 = x[:, D:] 2025-05-07T20:31:49.6302936Z 2025-05-07T20:31:49.6303128Z if contiguous: 2025-05-07T20:31:49.6303357Z x0 = x0.contiguous() 2025-05-07T20:31:49.6303620Z x1 = x1.contiguous() 2025-05-07T20:31:49.6303864Z 2025-05-07T20:31:49.6304051Z if scale_ub is not None: 2025-05-07T20:31:49.6304329Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.6304804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.6305121Z ) 2025-05-07T20:31:49.6305313Z else: 2025-05-07T20:31:49.6305527Z scale_ub_tensor = None 2025-05-07T20:31:49.6305781Z 2025-05-07T20:31:49.6306009Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.6306338Z op = silu_mul_quant 2025-05-07T20:31:49.6306593Z if compiled: 2025-05-07T20:31:49.6306835Z op = torch.compile(op) 2025-05-07T20:31:49.6307136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.6307412Z 2025-05-07T20:31:49.6307602Z y_fp8, y_scale = fn() 2025-05-07T20:31:49.6307886Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:49.6308180Z 2025-05-07T20:31:49.6308412Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.6308749Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:49.6309039Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:49.6309356Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:49.6309716Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.6310033Z 2025-05-07T20:31:49.6310238Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:49.6310439Z 2025-05-07T20:31:49.6310540Z moe/activation_test.py:126: 2025-05-07T20:31:49.6310839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.6311180Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:49.6311501Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:49.6312287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:49.6313038Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:49.6313582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.6314261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.6314945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:49.6315664Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:49.6316383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:49.6317018Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:49.6317618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:49.6318129Z fn() 2025-05-07T20:31:49.6318633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:49.6319213Z self.fn.run( 2025-05-07T20:31:49.6319677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.6320211Z kernel = self.compile( 2025-05-07T20:31:49.6320753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.6321494Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.6321889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.6322125Z 2025-05-07T20:31:49.6322331Z self = 2025-05-07T20:31:49.6323405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.6324866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93bd68d3a0>} 2025-05-07T20:31:49.6326199Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.6327211Z context = 2025-05-07T20:31:49.6327503Z 2025-05-07T20:31:49.6327672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.6328189Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.6328653Z module_map=module_map) 2025-05-07T20:31:49.6329019Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.6329383Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:49.6329651Z E ^ 2025-05-07T20:31:49.6330110Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.6330558Z 2025-05-07T20:31:49.6330967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.6331484Z 2025-05-07T20:31:49.6331586Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:49.6331999Z self=, 2025-05-07T20:31:49.6332392Z T=2048, 2025-05-07T20:31:49.6332581Z D=5120, 2025-05-07T20:31:49.6332775Z scale_ub=1200.0, 2025-05-07T20:31:49.6332993Z contiguous=True, 2025-05-07T20:31:49.6333218Z compiled=False, 2025-05-07T20:31:49.6333426Z ) 2025-05-07T20:31:49.6333820Z self = 2025-05-07T20:31:49.6334313Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:49.6334581Z 2025-05-07T20:31:49.6334666Z @given( 2025-05-07T20:31:49.6334905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:49.6335210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:49.6335517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:49.6335848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:49.6336176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:49.6336466Z ) 2025-05-07T20:31:49.6336815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:49.6337246Z def test_silu_mul_quant( 2025-05-07T20:31:49.6337487Z self, 2025-05-07T20:31:49.6337683Z T: int, 2025-05-07T20:31:49.6337876Z D: int, 2025-05-07T20:31:49.6338096Z scale_ub: Optional[float], 2025-05-07T20:31:49.6338371Z contiguous: bool, 2025-05-07T20:31:49.6338609Z compiled: bool, 2025-05-07T20:31:49.6338832Z ) -> None: 2025-05-07T20:31:49.6339047Z torch.manual_seed(2025) 2025-05-07T20:31:49.6339284Z 2025-05-07T20:31:49.6339563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:49.6339908Z 2025-05-07T20:31:49.6340102Z x_sign = torch.sign(x) 2025-05-07T20:31:49.6340387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:49.6340790Z x = x_sign * x_clamp 2025-05-07T20:31:49.6341037Z x0 = x[:, :D] 2025-05-07T20:31:49.6341251Z x1 = x[:, D:] 2025-05-07T20:31:49.6341461Z 2025-05-07T20:31:49.6341652Z if contiguous: 2025-05-07T20:31:49.6341881Z x0 = x0.contiguous() 2025-05-07T20:31:49.6342142Z x1 = x1.contiguous() 2025-05-07T20:31:49.6342386Z 2025-05-07T20:31:49.6342576Z if scale_ub is not None: 2025-05-07T20:31:49.6342852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:49.6343182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:49.6343480Z ) 2025-05-07T20:31:49.6343675Z else: 2025-05-07T20:31:49.6343968Z scale_ub_tensor = None 2025-05-07T20:31:49.6344215Z 2025-05-07T20:31:49.6344442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:49.6344754Z op = silu_mul_quant 2025-05-07T20:31:49.6344999Z if compiled: 2025-05-07T20:31:49.6345247Z op = torch.compile(op) 2025-05-07T20:31:49.6345545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.6345817Z 2025-05-07T20:31:49.6346003Z > y_fp8, y_scale = fn() 2025-05-07T20:31:49.6346172Z 2025-05-07T20:31:49.6346269Z moe/activation_test.py:117: 2025-05-07T20:31:49.6346569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.6346892Z moe/activation_test.py:115: in fn 2025-05-07T20:31:49.6347177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:49.6347856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:49.6348549Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:49.6349076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:49.6349749Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:49.6350410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:49.6350934Z kernel = self.compile( 2025-05-07T20:31:49.6351471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:49.6352154Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.6352573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:49.6352799Z 2025-05-07T20:31:49.6353004Z self = 2025-05-07T20:31:49.6354072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:49.6355417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93ce57e200>} 2025-05-07T20:31:49.6356745Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:49.6357752Z context = 2025-05-07T20:31:49.6358035Z 2025-05-07T20:31:49.6358201Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:49.6358718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.6359189Z module_map=module_map) 2025-05-07T20:31:49.6359545Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.6359897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.6360155Z E ^ 2025-05-07T20:31:49.6360704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.6361146Z 2025-05-07T20:31:49.6361554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:49.9001743Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.9003682Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:49.9007901Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.9010818Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.9012358Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9013731Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.9015097Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9016379Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.9017735Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9018762Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:31:49.9019993Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.9021224Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:49.9022065Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.9023298Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.9024486Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:49.9025494Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.9026500Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:49.9027694Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.9029129Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.9030011Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.9031073Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.9032175Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:49.9032940Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.9034089Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.9035411Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.9036457Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9037349Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9038087Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:49.9039084Z W0507 20:31:49.895000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:49.9708320Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:49.9709383Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:49.9710711Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:49.9712132Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:49.9713134Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:49.9714415Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:49.9715776Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:49.9717055Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:49.9718405Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:49.9727426Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:31:49.9728744Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:49.9729994Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:49.9730995Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.9732196Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:49.9733402Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:49.9734512Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:49.9735525Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:49.9736732Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:49.9737997Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:49.9738900Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:49.9739967Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:49.9740996Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:49.9741764Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:49.9742917Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:49.9744251Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:49.9745303Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:49.9746204Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:49.9746944Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:49.9747959Z W0507 20:31:49.967000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.1773987Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.1776239Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:50.1777578Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.1779086Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.1780057Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.1781340Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.1782686Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.1783965Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.1785323Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.1786352Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:31:50.1787599Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.1788820Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:50.1789659Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.1790851Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.1792043Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:50.1793112Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.1794108Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:50.1795309Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.1796570Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.1797459Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.1798787Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.1799818Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:50.1800576Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.1801839Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.1803224Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.1804275Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.1805170Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.1805897Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:50.1806900Z W0507 20:31:50.173000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.1874681Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.1875728Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:31:50.1877042Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.1878435Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.1879396Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.1880672Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.1882030Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.1883316Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.1884667Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.1885691Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] module_map=module_map) 2025-05-07T20:31:50.1886925Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.1888641Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:31:50.1889674Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.1891168Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.1892790Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:31:50.1893948Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.1894960Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:31:50.1896155Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.1897415Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.1898525Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.1899597Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.1900625Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:31:50.1901386Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.1902571Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.1903912Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.1904957Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.1905858Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.1906584Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:31:50.1907587Z W0507 20:31:50.184000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.4952294Z 2025-05-07T20:31:50.4952847Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.4953341Z self=, 2025-05-07T20:31:50.4953790Z T=2048, 2025-05-07T20:31:50.4953987Z D=5120, 2025-05-07T20:31:50.4954179Z scale_ub=1200.0, 2025-05-07T20:31:50.4954441Z contiguous=True, 2025-05-07T20:31:50.4954671Z compiled=True, 2025-05-07T20:31:50.4955086Z ) 2025-05-07T20:31:50.4955405Z self = 2025-05-07T20:31:50.4955897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:50.4956165Z 2025-05-07T20:31:50.4956248Z @given( 2025-05-07T20:31:50.4956472Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.4956789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.4957093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.4957412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.4957741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.4958028Z ) 2025-05-07T20:31:50.4958510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.4958959Z def test_silu_mul_quant( 2025-05-07T20:31:50.4959202Z self, 2025-05-07T20:31:50.4959399Z T: int, 2025-05-07T20:31:50.4959591Z D: int, 2025-05-07T20:31:50.4959819Z scale_ub: Optional[float], 2025-05-07T20:31:50.4960094Z contiguous: bool, 2025-05-07T20:31:50.4960329Z compiled: bool, 2025-05-07T20:31:50.4960553Z ) -> None: 2025-05-07T20:31:50.4960775Z torch.manual_seed(2025) 2025-05-07T20:31:50.4961008Z 2025-05-07T20:31:50.4961273Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.4961619Z 2025-05-07T20:31:50.4961811Z x_sign = torch.sign(x) 2025-05-07T20:31:50.4962099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.4962410Z x = x_sign * x_clamp 2025-05-07T20:31:50.4962648Z x0 = x[:, :D] 2025-05-07T20:31:50.4962866Z x1 = x[:, D:] 2025-05-07T20:31:50.4963075Z 2025-05-07T20:31:50.4963264Z if contiguous: 2025-05-07T20:31:50.4963495Z x0 = x0.contiguous() 2025-05-07T20:31:50.4963755Z x1 = x1.contiguous() 2025-05-07T20:31:50.4963986Z 2025-05-07T20:31:50.4964186Z if scale_ub is not None: 2025-05-07T20:31:50.4964460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.4964796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.4965099Z ) 2025-05-07T20:31:50.4965298Z else: 2025-05-07T20:31:50.4965518Z scale_ub_tensor = None 2025-05-07T20:31:50.4965761Z 2025-05-07T20:31:50.4965991Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.4966307Z op = silu_mul_quant 2025-05-07T20:31:50.4966548Z if compiled: 2025-05-07T20:31:50.4966796Z op = torch.compile(op) 2025-05-07T20:31:50.4967094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.4967362Z 2025-05-07T20:31:50.4967562Z y_fp8, y_scale = fn() 2025-05-07T20:31:50.4967845Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:50.4968128Z 2025-05-07T20:31:50.4968374Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.4968716Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:50.4969006Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:50.4969307Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:50.4969659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.4969966Z 2025-05-07T20:31:50.4970160Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:50.4970356Z 2025-05-07T20:31:50.4970457Z moe/activation_test.py:126: 2025-05-07T20:31:50.4970751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.4971079Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:50.4971402Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:50.4972180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:50.4972974Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:50.4973701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.4974379Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.4975058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:50.4975772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:50.4976480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:50.4977111Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:50.4977792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:50.4978305Z fn() 2025-05-07T20:31:50.4978808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:50.4979391Z self.fn.run( 2025-05-07T20:31:50.4979854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.4980372Z kernel = self.compile( 2025-05-07T20:31:50.4980912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.4981564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.4981954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.4982186Z 2025-05-07T20:31:50.4982396Z self = 2025-05-07T20:31:50.4983516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.4984864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93ce4bede0>} 2025-05-07T20:31:50.4986191Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.4987190Z context = 2025-05-07T20:31:50.4987482Z 2025-05-07T20:31:50.4987648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.4988169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.4988634Z module_map=module_map) 2025-05-07T20:31:50.4988996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.4989355Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:50.4989628Z E ^ 2025-05-07T20:31:50.4990078Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.4990524Z 2025-05-07T20:31:50.4990933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.4991444Z 2025-05-07T20:31:50.4991548Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:50.4991958Z self=, 2025-05-07T20:31:50.4992355Z T=16384, 2025-05-07T20:31:50.4992550Z D=7168, 2025-05-07T20:31:50.4992768Z scale_ub=1200.0, 2025-05-07T20:31:50.4993000Z contiguous=False, 2025-05-07T20:31:50.4993217Z compiled=False, 2025-05-07T20:31:50.4993423Z ) 2025-05-07T20:31:50.4993736Z self = 2025-05-07T20:31:50.4994223Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:50.4994588Z 2025-05-07T20:31:50.4994664Z @given( 2025-05-07T20:31:50.4994894Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:50.4995205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:50.4995503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:50.4995831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:50.4996154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:50.4996436Z ) 2025-05-07T20:31:50.4996783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:50.4997220Z def test_silu_mul_quant( 2025-05-07T20:31:50.4997456Z self, 2025-05-07T20:31:50.4997734Z T: int, 2025-05-07T20:31:50.4997935Z D: int, 2025-05-07T20:31:50.4998147Z scale_ub: Optional[float], 2025-05-07T20:31:50.4998581Z contiguous: bool, 2025-05-07T20:31:50.4998827Z compiled: bool, 2025-05-07T20:31:50.4999043Z ) -> None: 2025-05-07T20:31:50.4999256Z torch.manual_seed(2025) 2025-05-07T20:31:50.4999498Z 2025-05-07T20:31:50.4999761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:50.5000105Z 2025-05-07T20:31:50.5000297Z x_sign = torch.sign(x) 2025-05-07T20:31:50.5000589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:50.5000895Z x = x_sign * x_clamp 2025-05-07T20:31:50.5001131Z x0 = x[:, :D] 2025-05-07T20:31:50.5001344Z x1 = x[:, D:] 2025-05-07T20:31:50.5001547Z 2025-05-07T20:31:50.5001736Z if contiguous: 2025-05-07T20:31:50.5001971Z x0 = x0.contiguous() 2025-05-07T20:31:50.5002239Z x1 = x1.contiguous() 2025-05-07T20:31:50.5002518Z 2025-05-07T20:31:50.5002723Z if scale_ub is not None: 2025-05-07T20:31:50.5002998Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:50.5003334Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:50.5003637Z ) 2025-05-07T20:31:50.5003830Z else: 2025-05-07T20:31:50.5004043Z scale_ub_tensor = None 2025-05-07T20:31:50.5004286Z 2025-05-07T20:31:50.5004519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:50.5004834Z op = silu_mul_quant 2025-05-07T20:31:50.5005079Z if compiled: 2025-05-07T20:31:50.5005324Z op = torch.compile(op) 2025-05-07T20:31:50.5005628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.5005897Z 2025-05-07T20:31:50.5006095Z > y_fp8, y_scale = fn() 2025-05-07T20:31:50.5006256Z 2025-05-07T20:31:50.5006358Z moe/activation_test.py:117: 2025-05-07T20:31:50.5006654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.5006980Z moe/activation_test.py:115: in fn 2025-05-07T20:31:50.5007264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:50.5007950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:50.5008628Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:50.5009163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:50.5009839Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:50.5010491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:50.5011015Z kernel = self.compile( 2025-05-07T20:31:50.5011553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:50.5012206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.5012631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:50.5013010Z 2025-05-07T20:31:50.5013216Z self = 2025-05-07T20:31:50.5014355Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:50.5015708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a73cd440>} 2025-05-07T20:31:50.5017190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:50.5018194Z context = 2025-05-07T20:31:50.5018486Z 2025-05-07T20:31:50.5018650Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:50.5019169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.5019631Z module_map=module_map) 2025-05-07T20:31:50.5019990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.5020345Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.5020605Z E ^ 2025-05-07T20:31:50.5021059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.5021504Z 2025-05-07T20:31:50.5021915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:50.6802679Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.6803809Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:50.6805490Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.6807258Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.6808232Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.6809521Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.6810886Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.6812162Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.6813517Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.6814918Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:31:50.6816492Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.6817874Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:50.6818710Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.6819888Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.6821192Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:50.6822217Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.6823275Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:50.6824480Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.6825735Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.6826633Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.6827707Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.6828739Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:50.6829500Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.6830645Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.6831980Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.6833028Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.6833938Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.6834667Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:50.6835672Z W0507 20:31:50.676000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.7318820Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.7320884Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:50.7322919Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.7324455Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.7325411Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.7326799Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.7328154Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.7329432Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.7330776Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.7331796Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:31:50.7333086Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.7334392Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:50.7335225Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.7336404Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.7337585Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:50.7338607Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.7339607Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:50.7340803Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.7342061Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.7343001Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.7344078Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.7345103Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:50.7346125Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.7347578Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.7349269Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.7351033Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.7351939Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.7352684Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:50.7353685Z W0507 20:31:50.728000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.9027409Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.9029501Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:50.9032134Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.9033679Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.9034642Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.9035932Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.9037300Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.9038584Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.9039940Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.9040976Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:31:50.9042213Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.9043439Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:50.9044275Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.9045638Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.9046825Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:50.9047850Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.9048990Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:50.9050202Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.9051475Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.9052370Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.9053448Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.9054585Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:50.9055349Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.9056516Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.9057852Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.9058904Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.9059813Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.9060552Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:50.9061557Z W0507 20:31:50.899000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:50.9124078Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:50.9125120Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:31:50.9126434Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:50.9127832Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:50.9128949Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:50.9130231Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:50.9131591Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:50.9133067Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:50.9134523Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:50.9135566Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] module_map=module_map) 2025-05-07T20:31:50.9136811Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:50.9138040Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:31:50.9138884Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.9140065Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:50.9141266Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:31:50.9142289Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:50.9143303Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:31:50.9144506Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:50.9152717Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:50.9153641Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:50.9154724Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:50.9155762Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:31:50.9156523Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:50.9157673Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:50.9159123Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:50.9160170Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:50.9161067Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:50.9161797Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:31:50.9162880Z W0507 20:31:50.909000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.5866636Z 2025-05-07T20:31:51.5867392Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:51.5867880Z self=, 2025-05-07T20:31:51.5868301Z T=1, 2025-05-07T20:31:51.5868488Z D=7168, 2025-05-07T20:31:51.5868688Z scale_ub=None, 2025-05-07T20:31:51.5868907Z contiguous=True, 2025-05-07T20:31:51.5869129Z compiled=True, 2025-05-07T20:31:51.5869341Z ) 2025-05-07T20:31:51.5869663Z self = 2025-05-07T20:31:51.5870144Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:51.5870408Z 2025-05-07T20:31:51.5870489Z @given( 2025-05-07T20:31:51.5870755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:51.5871069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:51.5871382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:51.5871718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:51.5872063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:51.5872344Z ) 2025-05-07T20:31:51.5872722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:51.5873189Z def test_silu_mul_quant( 2025-05-07T20:31:51.5873434Z self, 2025-05-07T20:31:51.5873630Z T: int, 2025-05-07T20:31:51.5873859Z D: int, 2025-05-07T20:31:51.5874069Z scale_ub: Optional[float], 2025-05-07T20:31:51.5874336Z contiguous: bool, 2025-05-07T20:31:51.5874577Z compiled: bool, 2025-05-07T20:31:51.5874801Z ) -> None: 2025-05-07T20:31:51.5875021Z torch.manual_seed(2025) 2025-05-07T20:31:51.5875262Z 2025-05-07T20:31:51.5875529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:51.5875869Z 2025-05-07T20:31:51.5876057Z x_sign = torch.sign(x) 2025-05-07T20:31:51.5876337Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:51.5876647Z x = x_sign * x_clamp 2025-05-07T20:31:51.5876886Z x0 = x[:, :D] 2025-05-07T20:31:51.5877096Z x1 = x[:, D:] 2025-05-07T20:31:51.5877299Z 2025-05-07T20:31:51.5877484Z if contiguous: 2025-05-07T20:31:51.5877706Z x0 = x0.contiguous() 2025-05-07T20:31:51.5877961Z x1 = x1.contiguous() 2025-05-07T20:31:51.5878198Z 2025-05-07T20:31:51.5878382Z if scale_ub is not None: 2025-05-07T20:31:51.5878655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:51.5878985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:51.5879288Z ) 2025-05-07T20:31:51.5879473Z else: 2025-05-07T20:31:51.5879683Z scale_ub_tensor = None 2025-05-07T20:31:51.5879933Z 2025-05-07T20:31:51.5880155Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:51.5880466Z op = silu_mul_quant 2025-05-07T20:31:51.5880712Z if compiled: 2025-05-07T20:31:51.5881239Z op = torch.compile(op) 2025-05-07T20:31:51.5881533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:51.5881808Z 2025-05-07T20:31:51.5881992Z y_fp8, y_scale = fn() 2025-05-07T20:31:51.5882270Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:51.5882558Z 2025-05-07T20:31:51.5882785Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:51.5883119Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:51.5883408Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:51.5883717Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:51.5884067Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:51.5884377Z 2025-05-07T20:31:51.5884728Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:51.5884921Z 2025-05-07T20:31:51.5885023Z moe/activation_test.py:126: 2025-05-07T20:31:51.5885321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.5885664Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:51.5885983Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:51.5886767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:51.5887509Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:51.5888051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:51.5888719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:51.5889402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:51.5890120Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:51.5890834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:51.5891459Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:51.5892057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:51.5892704Z fn() 2025-05-07T20:31:51.5893206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:51.5893962Z self.fn.run( 2025-05-07T20:31:51.5894483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:51.5895014Z kernel = self.compile( 2025-05-07T20:31:51.5895551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:51.5896214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.5896609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.5896844Z 2025-05-07T20:31:51.5897054Z self = 2025-05-07T20:31:51.5898123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:51.5899976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a73cf920>} 2025-05-07T20:31:51.5901357Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:51.5902369Z context = 2025-05-07T20:31:51.5902856Z 2025-05-07T20:31:51.5903021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:51.5903536Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.5904001Z module_map=module_map) 2025-05-07T20:31:51.5904372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.5904719Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:51.5904982Z E ^ 2025-05-07T20:31:51.5905429Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.5905864Z 2025-05-07T20:31:51.5906388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:51.5906898Z 2025-05-07T20:31:51.5906999Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:51.5907403Z self=, 2025-05-07T20:31:51.5907805Z T=4096, 2025-05-07T20:31:51.5907986Z D=5120, 2025-05-07T20:31:51.5908177Z scale_ub=None, 2025-05-07T20:31:51.5908393Z contiguous=False, 2025-05-07T20:31:51.5908610Z compiled=False, 2025-05-07T20:31:51.5908814Z ) 2025-05-07T20:31:51.5909126Z self = 2025-05-07T20:31:51.5909607Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:51.5909885Z 2025-05-07T20:31:51.5909962Z @given( 2025-05-07T20:31:51.5910191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:51.5910502Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:51.5910800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:51.5911129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:51.5911453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:51.5911731Z ) 2025-05-07T20:31:51.5912078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:51.5912528Z def test_silu_mul_quant( 2025-05-07T20:31:51.5912759Z self, 2025-05-07T20:31:51.5912956Z T: int, 2025-05-07T20:31:51.5913154Z D: int, 2025-05-07T20:31:51.5913364Z scale_ub: Optional[float], 2025-05-07T20:31:51.5913638Z contiguous: bool, 2025-05-07T20:31:51.5913877Z compiled: bool, 2025-05-07T20:31:51.5914092Z ) -> None: 2025-05-07T20:31:51.5914306Z torch.manual_seed(2025) 2025-05-07T20:31:51.5914544Z 2025-05-07T20:31:51.5914804Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:51.5915146Z 2025-05-07T20:31:51.5915339Z x_sign = torch.sign(x) 2025-05-07T20:31:51.5915629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:51.5915929Z x = x_sign * x_clamp 2025-05-07T20:31:51.5916160Z x0 = x[:, :D] 2025-05-07T20:31:51.5916370Z x1 = x[:, D:] 2025-05-07T20:31:51.5916572Z 2025-05-07T20:31:51.5916753Z if contiguous: 2025-05-07T20:31:51.5916977Z x0 = x0.contiguous() 2025-05-07T20:31:51.5917227Z x1 = x1.contiguous() 2025-05-07T20:31:51.5917463Z 2025-05-07T20:31:51.5917652Z if scale_ub is not None: 2025-05-07T20:31:51.5917916Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:51.5918249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:51.5918556Z ) 2025-05-07T20:31:51.5918739Z else: 2025-05-07T20:31:51.5918949Z scale_ub_tensor = None 2025-05-07T20:31:51.5919201Z 2025-05-07T20:31:51.5919422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:51.5919737Z op = silu_mul_quant 2025-05-07T20:31:51.5919986Z if compiled: 2025-05-07T20:31:51.5920231Z op = torch.compile(op) 2025-05-07T20:31:51.5920523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:51.5920799Z 2025-05-07T20:31:51.5921084Z > y_fp8, y_scale = fn() 2025-05-07T20:31:51.5921244Z 2025-05-07T20:31:51.5921340Z moe/activation_test.py:117: 2025-05-07T20:31:51.5921632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.5921964Z moe/activation_test.py:115: in fn 2025-05-07T20:31:51.5922238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:51.5922923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:51.5923650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:51.5924186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:51.5924934Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:51.5925597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:51.5926135Z kernel = self.compile( 2025-05-07T20:31:51.5926662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:51.5927312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.5927708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:51.5927936Z 2025-05-07T20:31:51.5928152Z self = 2025-05-07T20:31:51.5929218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:51.5930570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a7554cc0>} 2025-05-07T20:31:51.5931900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:51.5932909Z context = 2025-05-07T20:31:51.5933191Z 2025-05-07T20:31:51.5933366Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:51.5933963Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.5934422Z module_map=module_map) 2025-05-07T20:31:51.5934783Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.5935130Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.5935388Z E ^ 2025-05-07T20:31:51.5935841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:51.5936283Z 2025-05-07T20:31:51.5936692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:51.8661142Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:51.8662233Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:51.8663653Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:51.8665169Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:51.8666409Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:51.8667692Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:51.8669050Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:51.8670479Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:51.8671828Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:51.8672857Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:31:51.8674092Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:51.8675319Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:51.8676159Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:51.8677338Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:51.8678529Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:51.8679548Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:51.8680550Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:51.8681756Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:51.8683065Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:51.8683955Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:51.8685023Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:51.8686047Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:51.8686813Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:51.8687955Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:51.8689372Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:51.8690417Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:51.8691319Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:51.8692054Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:51.8693128Z W0507 20:31:51.862000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.0356029Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.0358139Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:52.0360737Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.0363053Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.0364021Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.0365308Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.0366661Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.0367934Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.0369289Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.0370318Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:31:52.0371562Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.0372815Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:52.0373743Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.0374939Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.0376137Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:52.0377332Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:52.0378337Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:52.0379529Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.0380900Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.0381796Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.0382886Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:52.0383936Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:52.0384695Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:52.0385848Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.0387177Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.0388227Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.0389121Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.0389852Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:52.0390861Z W0507 20:31:52.031000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.2978726Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.2979774Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:52.2981073Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.2982467Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.2983435Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.2984714Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.2986224Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.2987498Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.2988841Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.2990008Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:31:52.2991248Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.2992478Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:52.2993360Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.2994544Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.2995736Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:52.2996756Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:52.2997766Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:52.2999127Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.3000388Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.3001286Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.3002359Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:52.3003392Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:52.3004147Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:52.3005298Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.3006638Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.3007688Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.3008708Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.3009445Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:52.3010452Z W0507 20:31:52.294000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:52.3076989Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:52.3078030Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:31:52.3079336Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:52.3080727Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:52.3081688Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:52.3083002Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:52.3084377Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:52.3085660Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:52.3087002Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:52.3088030Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] module_map=module_map) 2025-05-07T20:31:52.3089272Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:52.3090496Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:31:52.3091330Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.3092510Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:52.3093785Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:31:52.3094817Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:52.3095820Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:31:52.3097105Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:52.3098573Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:52.3099465Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:52.3100654Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:52.3101678Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:31:52.3102438Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:52.3103638Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:52.3104973Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:52.3106029Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:52.3106928Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:52.3107666Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:31:52.3108674Z W0507 20:31:52.304000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4833144Z 2025-05-07T20:31:53.4834186Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4835130Z self=, 2025-05-07T20:31:53.4835950Z T=4096, 2025-05-07T20:31:53.4836331Z D=7168, 2025-05-07T20:31:53.4836702Z scale_ub=None, 2025-05-07T20:31:53.4837127Z contiguous=False, 2025-05-07T20:31:53.4837603Z compiled=False, 2025-05-07T20:31:53.4838001Z ) 2025-05-07T20:31:53.4838628Z self = 2025-05-07T20:31:53.4839605Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.4840157Z 2025-05-07T20:31:53.4840323Z @given( 2025-05-07T20:31:53.4840772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.4841397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.4841996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.4842631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.4843271Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.4843761Z ) 2025-05-07T20:31:53.4844151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.4844593Z def test_silu_mul_quant( 2025-05-07T20:31:53.4844839Z self, 2025-05-07T20:31:53.4845034Z T: int, 2025-05-07T20:31:53.4845234Z D: int, 2025-05-07T20:31:53.4845453Z scale_ub: Optional[float], 2025-05-07T20:31:53.4845718Z contiguous: bool, 2025-05-07T20:31:53.4845962Z compiled: bool, 2025-05-07T20:31:53.4846604Z ) -> None: 2025-05-07T20:31:53.4846838Z torch.manual_seed(2025) 2025-05-07T20:31:53.4847078Z 2025-05-07T20:31:53.4847351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.4847702Z 2025-05-07T20:31:53.4847892Z x_sign = torch.sign(x) 2025-05-07T20:31:53.4848184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.4848495Z x = x_sign * x_clamp 2025-05-07T20:31:53.4848738Z x0 = x[:, :D] 2025-05-07T20:31:53.4848950Z x1 = x[:, D:] 2025-05-07T20:31:53.4849157Z 2025-05-07T20:31:53.4849343Z if contiguous: 2025-05-07T20:31:53.4849564Z x0 = x0.contiguous() 2025-05-07T20:31:53.4849822Z x1 = x1.contiguous() 2025-05-07T20:31:53.4858411Z 2025-05-07T20:31:53.4858650Z if scale_ub is not None: 2025-05-07T20:31:53.4858944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.4859296Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.4859608Z ) 2025-05-07T20:31:53.4859808Z else: 2025-05-07T20:31:53.4860025Z scale_ub_tensor = None 2025-05-07T20:31:53.4860273Z 2025-05-07T20:31:53.4860511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4860833Z op = silu_mul_quant 2025-05-07T20:31:53.4861082Z if compiled: 2025-05-07T20:31:53.4861334Z op = torch.compile(op) 2025-05-07T20:31:53.4861632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4861910Z 2025-05-07T20:31:53.4862100Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.4862271Z 2025-05-07T20:31:53.4862373Z moe/activation_test.py:117: 2025-05-07T20:31:53.4862683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4863018Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.4863304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4863997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.4864685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.4865221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4865901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4866587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4867125Z kernel = self.compile( 2025-05-07T20:31:53.4867659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4868321Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4868720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4868948Z 2025-05-07T20:31:53.4869165Z self = 2025-05-07T20:31:53.4870234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4871602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a75568e0>} 2025-05-07T20:31:53.4872933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4873948Z context = 2025-05-07T20:31:53.4874231Z 2025-05-07T20:31:53.4874395Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4875007Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4875475Z module_map=module_map) 2025-05-07T20:31:53.4875843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4876187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.4876450Z E ^ 2025-05-07T20:31:53.4876910Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4877352Z 2025-05-07T20:31:53.4877767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.4878273Z 2025-05-07T20:31:53.4878456Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.4878868Z self=, 2025-05-07T20:31:53.4879267Z T=128, 2025-05-07T20:31:53.4879454Z D=7168, 2025-05-07T20:31:53.4879655Z scale_ub=None, 2025-05-07T20:31:53.4879869Z contiguous=False, 2025-05-07T20:31:53.4880095Z compiled=True, 2025-05-07T20:31:53.4880298Z ) 2025-05-07T20:31:53.4880606Z self = 2025-05-07T20:31:53.4881094Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:53.4881358Z 2025-05-07T20:31:53.4881445Z @given( 2025-05-07T20:31:53.4881670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.4881986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.4882295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.4882618Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.4882958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.4883257Z ) 2025-05-07T20:31:53.4883645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.4884083Z def test_silu_mul_quant( 2025-05-07T20:31:53.4884329Z self, 2025-05-07T20:31:53.4884528Z T: int, 2025-05-07T20:31:53.4884716Z D: int, 2025-05-07T20:31:53.4884935Z scale_ub: Optional[float], 2025-05-07T20:31:53.4885203Z contiguous: bool, 2025-05-07T20:31:53.4885435Z compiled: bool, 2025-05-07T20:31:53.4885661Z ) -> None: 2025-05-07T20:31:53.4885874Z torch.manual_seed(2025) 2025-05-07T20:31:53.4886108Z 2025-05-07T20:31:53.4886378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.4886718Z 2025-05-07T20:31:53.4886906Z x_sign = torch.sign(x) 2025-05-07T20:31:53.4887195Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.4887508Z x = x_sign * x_clamp 2025-05-07T20:31:53.4887737Z x0 = x[:, :D] 2025-05-07T20:31:53.4887954Z x1 = x[:, D:] 2025-05-07T20:31:53.4888163Z 2025-05-07T20:31:53.4888348Z if contiguous: 2025-05-07T20:31:53.4888577Z x0 = x0.contiguous() 2025-05-07T20:31:53.4888835Z x1 = x1.contiguous() 2025-05-07T20:31:53.4889075Z 2025-05-07T20:31:53.4889261Z if scale_ub is not None: 2025-05-07T20:31:53.4889539Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.4889870Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.4890177Z ) 2025-05-07T20:31:53.4890376Z else: 2025-05-07T20:31:53.4890588Z scale_ub_tensor = None 2025-05-07T20:31:53.4890840Z 2025-05-07T20:31:53.4891069Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4891381Z op = silu_mul_quant 2025-05-07T20:31:53.4891625Z if compiled: 2025-05-07T20:31:53.4891877Z op = torch.compile(op) 2025-05-07T20:31:53.4892173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.4892444Z 2025-05-07T20:31:53.4892640Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.4892923Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.4893332Z 2025-05-07T20:31:53.4893590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.4894040Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.4894333Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.4894639Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.4895003Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.4895315Z 2025-05-07T20:31:53.4895510Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.4895707Z 2025-05-07T20:31:53.4895807Z moe/activation_test.py:126: 2025-05-07T20:31:53.4896102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4896521Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.4896842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.4897624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.4898637Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.4899175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.4899851Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.4900533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.4901245Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.4901964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.4902602Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.4903198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.4903717Z fn() 2025-05-07T20:31:53.4904212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.4904784Z self.fn.run( 2025-05-07T20:31:53.4905246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.4905765Z kernel = self.compile( 2025-05-07T20:31:53.4906302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.4906946Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.4907340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.4907572Z 2025-05-07T20:31:53.4907779Z self = 2025-05-07T20:31:53.4908853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.4910206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a75558a0>} 2025-05-07T20:31:53.4911526Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.4912527Z context = 2025-05-07T20:31:53.4912811Z 2025-05-07T20:31:53.4912981Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.4913495Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.4914105Z module_map=module_map) 2025-05-07T20:31:53.4914464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.4914819Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.4915083Z E ^ 2025-05-07T20:31:53.4915533Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.4915975Z 2025-05-07T20:31:53.4916385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.7294045Z 2025-05-07T20:31:53.7294319Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.7294758Z self=, 2025-05-07T20:31:53.7295397Z T=128, 2025-05-07T20:31:53.7295605Z D=7168, 2025-05-07T20:31:53.7295802Z scale_ub=None, 2025-05-07T20:31:53.7296031Z contiguous=False, 2025-05-07T20:31:53.7296266Z compiled=False, 2025-05-07T20:31:53.7296479Z ) 2025-05-07T20:31:53.7296806Z self = 2025-05-07T20:31:53.7297302Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:53.7297569Z 2025-05-07T20:31:53.7297660Z @given( 2025-05-07T20:31:53.7297893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7298368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7298683Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7299009Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7299341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7299632Z ) 2025-05-07T20:31:53.7299982Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7300426Z def test_silu_mul_quant( 2025-05-07T20:31:53.7300670Z self, 2025-05-07T20:31:53.7300865Z T: int, 2025-05-07T20:31:53.7301067Z D: int, 2025-05-07T20:31:53.7301296Z scale_ub: Optional[float], 2025-05-07T20:31:53.7301564Z contiguous: bool, 2025-05-07T20:31:53.7301811Z compiled: bool, 2025-05-07T20:31:53.7302045Z ) -> None: 2025-05-07T20:31:53.7302268Z torch.manual_seed(2025) 2025-05-07T20:31:53.7302506Z 2025-05-07T20:31:53.7302782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7303129Z 2025-05-07T20:31:53.7303323Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7303618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7303931Z x = x_sign * x_clamp 2025-05-07T20:31:53.7304169Z x0 = x[:, :D] 2025-05-07T20:31:53.7304390Z x1 = x[:, D:] 2025-05-07T20:31:53.7304600Z 2025-05-07T20:31:53.7304795Z if contiguous: 2025-05-07T20:31:53.7305031Z x0 = x0.contiguous() 2025-05-07T20:31:53.7305296Z x1 = x1.contiguous() 2025-05-07T20:31:53.7305537Z 2025-05-07T20:31:53.7305745Z if scale_ub is not None: 2025-05-07T20:31:53.7306024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7306356Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7306669Z ) 2025-05-07T20:31:53.7306870Z else: 2025-05-07T20:31:53.7307089Z scale_ub_tensor = None 2025-05-07T20:31:53.7307339Z 2025-05-07T20:31:53.7307573Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7307894Z op = silu_mul_quant 2025-05-07T20:31:53.7308147Z if compiled: 2025-05-07T20:31:53.7308404Z op = torch.compile(op) 2025-05-07T20:31:53.7308706Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7308979Z 2025-05-07T20:31:53.7309184Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.7309349Z 2025-05-07T20:31:53.7309456Z moe/activation_test.py:117: 2025-05-07T20:31:53.7309749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7310272Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.7310560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7311248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.7311929Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.7312466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7313150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7313854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7314542Z kernel = self.compile( 2025-05-07T20:31:53.7315085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.7315738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.7316140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7316372Z 2025-05-07T20:31:53.7316578Z self = 2025-05-07T20:31:53.7317642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.7318991Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659e700>} 2025-05-07T20:31:53.7320315Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.7321326Z context = 2025-05-07T20:31:53.7321619Z 2025-05-07T20:31:53.7321785Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.7322305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.7322769Z module_map=module_map) 2025-05-07T20:31:53.7323138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.7323525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.7323809Z E ^ 2025-05-07T20:31:53.7324268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.7324718Z 2025-05-07T20:31:53.7325135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.7325642Z 2025-05-07T20:31:53.7325756Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.7326166Z self=, 2025-05-07T20:31:53.7326568Z T=4096, 2025-05-07T20:31:53.7326764Z D=5120, 2025-05-07T20:31:53.7326959Z scale_ub=1200.0, 2025-05-07T20:31:53.7327186Z contiguous=True, 2025-05-07T20:31:53.7327416Z compiled=False, 2025-05-07T20:31:53.7327626Z ) 2025-05-07T20:31:53.7327942Z self = 2025-05-07T20:31:53.7328445Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:53.7328715Z 2025-05-07T20:31:53.7328801Z @given( 2025-05-07T20:31:53.7329029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7329352Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7329662Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7329986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7330315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7330695Z ) 2025-05-07T20:31:53.7331039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7331485Z def test_silu_mul_quant( 2025-05-07T20:31:53.7331730Z self, 2025-05-07T20:31:53.7331929Z T: int, 2025-05-07T20:31:53.7332124Z D: int, 2025-05-07T20:31:53.7332353Z scale_ub: Optional[float], 2025-05-07T20:31:53.7332627Z contiguous: bool, 2025-05-07T20:31:53.7332865Z compiled: bool, 2025-05-07T20:31:53.7333090Z ) -> None: 2025-05-07T20:31:53.7333309Z torch.manual_seed(2025) 2025-05-07T20:31:53.7333546Z 2025-05-07T20:31:53.7333880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7334302Z 2025-05-07T20:31:53.7334498Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7334787Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7335098Z x = x_sign * x_clamp 2025-05-07T20:31:53.7335341Z x0 = x[:, :D] 2025-05-07T20:31:53.7335563Z x1 = x[:, D:] 2025-05-07T20:31:53.7335774Z 2025-05-07T20:31:53.7335956Z if contiguous: 2025-05-07T20:31:53.7336190Z x0 = x0.contiguous() 2025-05-07T20:31:53.7336453Z x1 = x1.contiguous() 2025-05-07T20:31:53.7336696Z 2025-05-07T20:31:53.7336886Z if scale_ub is not None: 2025-05-07T20:31:53.7337163Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7337504Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7337809Z ) 2025-05-07T20:31:53.7338011Z else: 2025-05-07T20:31:53.7338225Z scale_ub_tensor = None 2025-05-07T20:31:53.7338474Z 2025-05-07T20:31:53.7338718Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7339041Z op = silu_mul_quant 2025-05-07T20:31:53.7339287Z if compiled: 2025-05-07T20:31:53.7339538Z op = torch.compile(op) 2025-05-07T20:31:53.7339846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7340120Z 2025-05-07T20:31:53.7340318Z > y_fp8, y_scale = fn() 2025-05-07T20:31:53.7340481Z 2025-05-07T20:31:53.7340588Z moe/activation_test.py:117: 2025-05-07T20:31:53.7340888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7341218Z moe/activation_test.py:115: in fn 2025-05-07T20:31:53.7341502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7342190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:53.7342875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:53.7343419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7344103Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7344769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7345303Z kernel = self.compile( 2025-05-07T20:31:53.7345844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.7346501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.7346895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7347131Z 2025-05-07T20:31:53.7347339Z self = 2025-05-07T20:31:53.7348412Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.7349767Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659c360>} 2025-05-07T20:31:53.7351177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.7352176Z context = 2025-05-07T20:31:53.7352467Z 2025-05-07T20:31:53.7352633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.7353151Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.7353665Z module_map=module_map) 2025-05-07T20:31:53.7354097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.7354452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.7354713Z E ^ 2025-05-07T20:31:53.7355170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.7355623Z 2025-05-07T20:31:53.7356032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.7356545Z 2025-05-07T20:31:53.7356649Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:53.7357059Z self=, 2025-05-07T20:31:53.7357455Z T=1, 2025-05-07T20:31:53.7357644Z D=5120, 2025-05-07T20:31:53.7357841Z scale_ub=None, 2025-05-07T20:31:53.7358055Z contiguous=True, 2025-05-07T20:31:53.7358284Z compiled=True, 2025-05-07T20:31:53.7358488Z ) 2025-05-07T20:31:53.7358805Z self = 2025-05-07T20:31:53.7359287Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:53.7359543Z 2025-05-07T20:31:53.7359628Z @given( 2025-05-07T20:31:53.7359856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:53.7360178Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:53.7360489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:53.7360822Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:53.7361143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:53.7361432Z ) 2025-05-07T20:31:53.7361782Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:53.7362220Z def test_silu_mul_quant( 2025-05-07T20:31:53.7362465Z self, 2025-05-07T20:31:53.7362663Z T: int, 2025-05-07T20:31:53.7362858Z D: int, 2025-05-07T20:31:53.7363084Z scale_ub: Optional[float], 2025-05-07T20:31:53.7363391Z contiguous: bool, 2025-05-07T20:31:53.7363653Z compiled: bool, 2025-05-07T20:31:53.7363878Z ) -> None: 2025-05-07T20:31:53.7364101Z torch.manual_seed(2025) 2025-05-07T20:31:53.7364340Z 2025-05-07T20:31:53.7364619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:53.7364964Z 2025-05-07T20:31:53.7365164Z x_sign = torch.sign(x) 2025-05-07T20:31:53.7365453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:53.7367243Z x = x_sign * x_clamp 2025-05-07T20:31:53.7367484Z x0 = x[:, :D] 2025-05-07T20:31:53.7367701Z x1 = x[:, D:] 2025-05-07T20:31:53.7367912Z 2025-05-07T20:31:53.7368103Z if contiguous: 2025-05-07T20:31:53.7368330Z x0 = x0.contiguous() 2025-05-07T20:31:53.7368592Z x1 = x1.contiguous() 2025-05-07T20:31:53.7368838Z 2025-05-07T20:31:53.7369026Z if scale_ub is not None: 2025-05-07T20:31:53.7369312Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:53.7369656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:53.7369964Z ) 2025-05-07T20:31:53.7370161Z else: 2025-05-07T20:31:53.7370376Z scale_ub_tensor = None 2025-05-07T20:31:53.7370710Z 2025-05-07T20:31:53.7370942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7371262Z op = silu_mul_quant 2025-05-07T20:31:53.7371518Z if compiled: 2025-05-07T20:31:53.7371761Z op = torch.compile(op) 2025-05-07T20:31:53.7372056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:53.7372334Z 2025-05-07T20:31:53.7372522Z y_fp8, y_scale = fn() 2025-05-07T20:31:53.7372829Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:53.7373120Z 2025-05-07T20:31:53.7373355Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:53.7373765Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:53.7374161Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:53.7374479Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:53.7374833Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.7375150Z 2025-05-07T20:31:53.7375351Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:53.7375542Z 2025-05-07T20:31:53.7375643Z moe/activation_test.py:126: 2025-05-07T20:31:53.7375944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7376278Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:53.7376599Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:53.7377378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:53.7378138Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:53.7378682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:53.7379360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:53.7380040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:53.7380754Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:53.7381470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:53.7382104Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:53.7382702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:53.7390263Z fn() 2025-05-07T20:31:53.7390810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:53.7391405Z self.fn.run( 2025-05-07T20:31:53.7391870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:53.7392406Z kernel = self.compile( 2025-05-07T20:31:53.7392946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:53.7393601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.7393997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:53.7394232Z 2025-05-07T20:31:53.7394439Z self = 2025-05-07T20:31:53.7395510Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:53.7396869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659ef20>} 2025-05-07T20:31:53.7398423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:53.7399610Z context = 2025-05-07T20:31:53.7399896Z 2025-05-07T20:31:53.7400064Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:53.7400578Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.7401032Z module_map=module_map) 2025-05-07T20:31:53.7401396Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.7401748Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:53.7402011Z E ^ 2025-05-07T20:31:53.7402583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:53.7403035Z 2025-05-07T20:31:53.7403445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:53.9635724Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:53.9636797Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:53.9638125Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:53.9639535Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:53.9640512Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:53.9641808Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:53.9643162Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:53.9644500Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:53.9645849Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:53.9646892Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:31:53.9648140Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:53.9649367Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:53.9650216Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:53.9651398Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:53.9652745Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:53.9653933Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:53.9654933Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:53.9656253Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:53.9657515Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:53.9658408Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:53.9659477Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:53.9660506Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:53.9661269Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:53.9662422Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:53.9663761Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:53.9664809Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:53.9665712Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:53.9666456Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:53.9667465Z W0507 20:31:53.959000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.0248353Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.0249410Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:54.0250720Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.0252112Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.0253084Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.0254461Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.0255964Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.0257248Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.0258718Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.0259756Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:31:54.0260997Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.0262222Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:54.0263051Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.0264236Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.0265418Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:54.0266436Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:54.0267434Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:54.0268627Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.0269882Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.0270763Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.0271831Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:54.0272850Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:54.0273610Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:54.0274761Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.0276081Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.0277210Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.0278102Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.0278831Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:54.0279830Z W0507 20:31:54.021000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.2107701Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.2108757Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:54.2110086Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.2111484Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.2112459Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2113741Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.2115102Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.2116380Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.2117730Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.2118769Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:31:54.2120003Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.2121231Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:54.2122073Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.2123258Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.2124504Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:54.2125521Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:54.2126652Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:54.2127854Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.2129119Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.2130087Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.2131156Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:54.2132190Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:54.2132954Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:54.2134247Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.2135587Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.2136636Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.2137549Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.2138287Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:54.2139294Z W0507 20:31:54.207000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.2198599Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.2199642Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:31:54.2200969Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.2202376Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.2203346Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.2204682Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.2206042Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.2207466Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.2208815Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.2209848Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] module_map=module_map) 2025-05-07T20:31:54.2211197Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.2212421Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:31:54.2213271Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.2214577Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.2215770Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:31:54.2216794Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:54.2217803Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:31:54.2219008Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.2220271Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.2221168Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.2222241Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:54.2223274Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:31:54.2224098Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:54.2225258Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.2226600Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.2227653Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.2228564Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.2229303Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:31:54.2230398Z W0507 20:31:54.216000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.8412816Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.8414533Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:54.8417475Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.8420268Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.8422202Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.8424163Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.8425519Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.8426805Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.8428163Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.8429195Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:31:54.8430435Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.8431658Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:54.8432494Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.8433689Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.8434881Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:54.8435904Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:54.8436914Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:54.8438112Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.8439524Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.8440418Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.8441490Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:54.8442640Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:54.8443408Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:54.8444611Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.8445950Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.8446992Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.8447894Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.8448641Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:54.8449649Z W0507 20:31:54.639000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:54.9039548Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:54.9040589Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:54.9041901Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:54.9043296Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:54.9044264Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:54.9045549Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:54.9046905Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:54.9048183Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:54.9049529Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:54.9050705Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:31:54.9051945Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:54.9053169Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:54.9054275Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.9055462Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:54.9056662Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:54.9057685Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:54.9058689Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:54.9059892Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:54.9061150Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:54.9062047Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:54.9063119Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:54.9064145Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:54.9064908Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:54.9066067Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:54.9067403Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:54.9068446Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:54.9069345Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:54.9070080Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:54.9071091Z W0507 20:31:54.900000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.0896995Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.0898385Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:55.0899706Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.0901103Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.0902194Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.0903473Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.0904838Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.0906131Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.0907487Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.0908523Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:31:55.0909773Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.0910994Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:55.0911837Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.0913037Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.0914233Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:55.0915268Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.0916276Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:55.0917481Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.0918751Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.0919646Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.0920848Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.0921872Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:55.0922637Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.0923890Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.0925229Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.0926280Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.0927185Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.0927924Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:55.0928930Z W0507 20:31:55.086000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.0990630Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.0991664Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:31:55.0992980Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.0994425Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.0995396Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.0996681Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.0998047Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.0999540Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.1000890Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.1001928Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] module_map=module_map) 2025-05-07T20:31:55.1003169Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.1004537Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:31:55.1005372Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.1006555Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.1007852Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:31:55.1008878Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.1009889Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:31:55.1011090Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.1012355Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.1013258Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.1014491Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.1015516Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:31:55.1016281Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.1017433Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.1018775Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.1019824Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.1020724Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.1021465Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:31:55.1022474Z W0507 20:31:55.095000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2863331Z 2025-05-07T20:31:55.2863484Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2863917Z self=, 2025-05-07T20:31:55.2864372Z T=2048, 2025-05-07T20:31:55.2864587Z D=5120, 2025-05-07T20:31:55.2864790Z scale_ub=None, 2025-05-07T20:31:55.2865013Z contiguous=True, 2025-05-07T20:31:55.2865238Z compiled=True, 2025-05-07T20:31:55.2865448Z ) 2025-05-07T20:31:55.2865946Z self = 2025-05-07T20:31:55.2866433Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:55.2866706Z 2025-05-07T20:31:55.2866788Z @given( 2025-05-07T20:31:55.2867029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2867343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2867652Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2867985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2868310Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2868602Z ) 2025-05-07T20:31:55.2869070Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2869522Z def test_silu_mul_quant( 2025-05-07T20:31:55.2869766Z self, 2025-05-07T20:31:55.2869972Z T: int, 2025-05-07T20:31:55.2870174Z D: int, 2025-05-07T20:31:55.2870398Z scale_ub: Optional[float], 2025-05-07T20:31:55.2870676Z contiguous: bool, 2025-05-07T20:31:55.2870921Z compiled: bool, 2025-05-07T20:31:55.2871145Z ) -> None: 2025-05-07T20:31:55.2871368Z torch.manual_seed(2025) 2025-05-07T20:31:55.2871613Z 2025-05-07T20:31:55.2871884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2872232Z 2025-05-07T20:31:55.2872430Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2872719Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2873033Z x = x_sign * x_clamp 2025-05-07T20:31:55.2873278Z x0 = x[:, :D] 2025-05-07T20:31:55.2873495Z x1 = x[:, D:] 2025-05-07T20:31:55.2873707Z 2025-05-07T20:31:55.2873906Z if contiguous: 2025-05-07T20:31:55.2874139Z x0 = x0.contiguous() 2025-05-07T20:31:55.2874400Z x1 = x1.contiguous() 2025-05-07T20:31:55.2874645Z 2025-05-07T20:31:55.2874843Z if scale_ub is not None: 2025-05-07T20:31:55.2875118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2875456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2875775Z ) 2025-05-07T20:31:55.2875969Z else: 2025-05-07T20:31:55.2876187Z scale_ub_tensor = None 2025-05-07T20:31:55.2876446Z 2025-05-07T20:31:55.2876678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2877000Z op = silu_mul_quant 2025-05-07T20:31:55.2877256Z if compiled: 2025-05-07T20:31:55.2877501Z op = torch.compile(op) 2025-05-07T20:31:55.2877803Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2878089Z 2025-05-07T20:31:55.2878281Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.2878575Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.2878875Z 2025-05-07T20:31:55.2879115Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2879458Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.2879754Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.2880075Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.2880429Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2880745Z 2025-05-07T20:31:55.2880952Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.2881145Z 2025-05-07T20:31:55.2881249Z moe/activation_test.py:126: 2025-05-07T20:31:55.2881547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2881886Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.2882216Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2882999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.2883749Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.2884388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2885064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2885759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.2886482Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2887202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.2887833Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.2888504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.2889024Z fn() 2025-05-07T20:31:55.2889528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.2890109Z self.fn.run( 2025-05-07T20:31:55.2890579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2891111Z kernel = self.compile( 2025-05-07T20:31:55.2891646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2892296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2892695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2892926Z 2025-05-07T20:31:55.2893140Z self = 2025-05-07T20:31:55.2894378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2895735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a651ab60>} 2025-05-07T20:31:55.2897062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2904415Z context = 2025-05-07T20:31:55.2904742Z 2025-05-07T20:31:55.2904914Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2905437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2905894Z module_map=module_map) 2025-05-07T20:31:55.2906256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2906608Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.2906875Z E ^ 2025-05-07T20:31:55.2907330Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2907778Z 2025-05-07T20:31:55.2908193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.2908698Z 2025-05-07T20:31:55.2908813Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:55.2909218Z self=, 2025-05-07T20:31:55.2909616Z T=128, 2025-05-07T20:31:55.2909833Z D=5120, 2025-05-07T20:31:55.2910028Z scale_ub=None, 2025-05-07T20:31:55.2910239Z contiguous=True, 2025-05-07T20:31:55.2910466Z compiled=True, 2025-05-07T20:31:55.2910671Z ) 2025-05-07T20:31:55.2910983Z self = 2025-05-07T20:31:55.2911463Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:55.2911893Z 2025-05-07T20:31:55.2911974Z @given( 2025-05-07T20:31:55.2912208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:55.2912513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:55.2912811Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:55.2913139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:55.2913456Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:55.2913742Z ) 2025-05-07T20:31:55.2914090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:55.2914526Z def test_silu_mul_quant( 2025-05-07T20:31:55.2914766Z self, 2025-05-07T20:31:55.2914960Z T: int, 2025-05-07T20:31:55.2915274Z D: int, 2025-05-07T20:31:55.2915490Z scale_ub: Optional[float], 2025-05-07T20:31:55.2915772Z contiguous: bool, 2025-05-07T20:31:55.2916015Z compiled: bool, 2025-05-07T20:31:55.2916242Z ) -> None: 2025-05-07T20:31:55.2916458Z torch.manual_seed(2025) 2025-05-07T20:31:55.2916704Z 2025-05-07T20:31:55.2916968Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:55.2917303Z 2025-05-07T20:31:55.2917495Z x_sign = torch.sign(x) 2025-05-07T20:31:55.2917776Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:55.2918083Z x = x_sign * x_clamp 2025-05-07T20:31:55.2918320Z x0 = x[:, :D] 2025-05-07T20:31:55.2918533Z x1 = x[:, D:] 2025-05-07T20:31:55.2918735Z 2025-05-07T20:31:55.2918920Z if contiguous: 2025-05-07T20:31:55.2919149Z x0 = x0.contiguous() 2025-05-07T20:31:55.2919406Z x1 = x1.contiguous() 2025-05-07T20:31:55.2919639Z 2025-05-07T20:31:55.2919825Z if scale_ub is not None: 2025-05-07T20:31:55.2920099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:55.2920422Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:55.2920728Z ) 2025-05-07T20:31:55.2920919Z else: 2025-05-07T20:31:55.2921122Z scale_ub_tensor = None 2025-05-07T20:31:55.2921370Z 2025-05-07T20:31:55.2921597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2921903Z op = silu_mul_quant 2025-05-07T20:31:55.2922144Z if compiled: 2025-05-07T20:31:55.2922390Z op = torch.compile(op) 2025-05-07T20:31:55.2922679Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:55.2922954Z 2025-05-07T20:31:55.2923144Z y_fp8, y_scale = fn() 2025-05-07T20:31:55.2923436Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:55.2923746Z 2025-05-07T20:31:55.2924008Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:55.2924337Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:55.2924618Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:55.2924925Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:55.2925276Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2925578Z 2025-05-07T20:31:55.2925774Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:55.2925963Z 2025-05-07T20:31:55.2926067Z moe/activation_test.py:126: 2025-05-07T20:31:55.2926354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2926678Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:55.2926995Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:55.2927765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:55.2928502Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:55.2929039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:55.2929707Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:55.2930471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:55.2931177Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:55.2931894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:55.2932522Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:55.2933115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:55.2933674Z fn() 2025-05-07T20:31:55.2934305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:55.2934879Z self.fn.run( 2025-05-07T20:31:55.2935332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:55.2935862Z kernel = self.compile( 2025-05-07T20:31:55.2936389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:55.2937031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.2937420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:55.2937651Z 2025-05-07T20:31:55.2937859Z self = 2025-05-07T20:31:55.2938932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:55.2940273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a6c42700>} 2025-05-07T20:31:55.2941592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:55.2942595Z context = 2025-05-07T20:31:55.2942882Z 2025-05-07T20:31:55.2943044Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:55.2943554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.2944008Z module_map=module_map) 2025-05-07T20:31:55.2944377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.2944732Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:55.2944993Z E ^ 2025-05-07T20:31:55.2945449Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.2945894Z 2025-05-07T20:31:55.2946301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:55.5234148Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.5235295Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:55.5236632Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.5238028Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.5239174Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5240464Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.5241822Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.5243220Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.5244582Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.5245620Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:31:55.5246869Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.5248105Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.5248948Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.5250141Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.5251344Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:55.5252372Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.5253380Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:55.5254722Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.5255990Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.5256889Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.5257971Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.5259003Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:55.5259765Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.5260920Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.5262689Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.5263993Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.5265111Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.5266102Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:55.5267126Z W0507 20:31:55.519000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.5851723Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.5853789Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:55.5855108Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.5856506Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.5857477Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.5858774Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.5860136Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.5861424Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.5862773Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.5863867Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:31:55.5865107Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.5866334Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.5867179Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.5868365Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.5869696Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:55.5870718Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.5871735Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:55.5873043Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.5874348Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.5875252Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.5876328Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.5877351Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:55.5878106Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.5879263Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.5880597Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.5881650Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.5882552Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.5883286Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:55.5884351Z W0507 20:31:55.581000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.7722223Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.7723542Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:55.7726261Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.7729778Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.7732170Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.7734584Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.7736111Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.7737392Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.7738849Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.7739877Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:31:55.7741129Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.7742358Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.7743192Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.7744381Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.7745571Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:55.7746602Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.7747609Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:55.7748807Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.7750066Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.7750954Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.7752032Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.7753059Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:55.7753823Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.7755013Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.7756350Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.7757397Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.7758376Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.7759111Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:55.7760109Z W0507 20:31:55.768000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:55.7821293Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:55.7823200Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:31:55.7824867Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:55.7826260Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:55.7827221Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:55.7828511Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:55.7829869Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:55.7831159Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:55.7832507Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:55.7833544Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] module_map=module_map) 2025-05-07T20:31:55.7834777Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:55.7836002Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:31:55.7836836Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.7838018Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:55.7839208Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:31:55.7840230Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:55.7841381Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:31:55.7842579Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:55.7843836Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:55.7844847Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:55.7845929Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:55.7846962Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:31:55.7847726Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:55.7848874Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:55.7850202Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:55.7851258Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:55.7852160Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:55.7852903Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:31:55.7854050Z W0507 20:31:55.778000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.2391437Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.2393392Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:56.2395213Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.2396616Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.2397582Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.2399037Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.2400390Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.2401847Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.2403197Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.2404229Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:31:56.2405577Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.2406798Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.2407650Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.2408835Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.2410026Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:56.2411055Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:56.2412057Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:56.2413264Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.2414642Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.2415540Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.2416612Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:56.2417637Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:56.2418400Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:56.2419559Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.2420894Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.2421944Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.2422852Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.2423592Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:56.2424793Z W0507 20:31:56.235000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.3010975Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.3012024Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:56.3013499Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.3015031Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.3016007Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.3017293Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.3018650Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.3019937Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.3021293Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.3022326Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:31:56.3023568Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.3024859Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.3025694Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.3026887Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.3028087Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:56.3029118Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:56.3030127Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:56.3031333Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.3032716Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.3033609Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.3034684Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:56.3035712Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:56.3036570Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:56.3037727Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.3039066Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.3040114Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.3041012Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.3041756Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:56.3042766Z W0507 20:31:56.297000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.4890279Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.4891381Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:56.4892736Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.4894313Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.4895305Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.4896618Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.4898012Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.4899624Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.4900990Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.4902463Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:31:56.4903708Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.4904948Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.4905945Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.4907137Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.4908345Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:56.4909367Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:56.4910383Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:56.4911596Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.4912863Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.4913770Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.4914841Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:56.4915877Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:56.4916648Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:56.4917812Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.4919153Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.4920195Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.4921104Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.4921852Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:56.4922874Z W0507 20:31:56.485000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.4983881Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:31:56.4985155Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:31:56.4986474Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:31:56.4987873Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:31:56.4988929Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:31:56.4990229Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:31:56.4991591Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.4992882Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:31:56.4994247Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.4995289Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] module_map=module_map) 2025-05-07T20:31:56.4996540Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:31:56.4997767Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:31:56.4998960Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.5000166Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:31:56.5001369Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:31:56.5002409Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:31:56.5003417Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:31:56.5004632Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:31:56.5005906Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:31:56.5006816Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:31:56.5008067Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:31:56.5009105Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:31:56.5009883Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:31:56.5011157Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:31:56.5012504Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:31:56.5013562Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.5014574Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.5015325Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:31:56.5016343Z W0507 20:31:56.494000 275344 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7220849Z 2025-05-07T20:31:56.7221209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7221680Z self=, 2025-05-07T20:31:56.7222104Z T=4096, 2025-05-07T20:31:56.7222305Z D=5120, 2025-05-07T20:31:56.7230149Z scale_ub=None, 2025-05-07T20:31:56.7230388Z contiguous=True, 2025-05-07T20:31:56.7230600Z compiled=True, 2025-05-07T20:31:56.7230795Z ) 2025-05-07T20:31:56.7231144Z self = 2025-05-07T20:31:56.7231630Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.7231891Z 2025-05-07T20:31:56.7231967Z @given( 2025-05-07T20:31:56.7232185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7232492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7232791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7233105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7233428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7233707Z ) 2025-05-07T20:31:56.7234044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7234476Z def test_silu_mul_quant( 2025-05-07T20:31:56.7234710Z self, 2025-05-07T20:31:56.7234897Z T: int, 2025-05-07T20:31:56.7235075Z D: int, 2025-05-07T20:31:56.7235283Z scale_ub: Optional[float], 2025-05-07T20:31:56.7235552Z contiguous: bool, 2025-05-07T20:31:56.7235777Z compiled: bool, 2025-05-07T20:31:56.7235992Z ) -> None: 2025-05-07T20:31:56.7236202Z torch.manual_seed(2025) 2025-05-07T20:31:56.7236428Z 2025-05-07T20:31:56.7236691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7237026Z 2025-05-07T20:31:56.7237208Z x_sign = torch.sign(x) 2025-05-07T20:31:56.7237489Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.7237789Z x = x_sign * x_clamp 2025-05-07T20:31:56.7238031Z x0 = x[:, :D] 2025-05-07T20:31:56.7238235Z x1 = x[:, D:] 2025-05-07T20:31:56.7238431Z 2025-05-07T20:31:56.7238606Z if contiguous: 2025-05-07T20:31:56.7238824Z x0 = x0.contiguous() 2025-05-07T20:31:56.7239263Z x1 = x1.contiguous() 2025-05-07T20:31:56.7239495Z 2025-05-07T20:31:56.7239677Z if scale_ub is not None: 2025-05-07T20:31:56.7239943Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.7240267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.7240563Z ) 2025-05-07T20:31:56.7240746Z else: 2025-05-07T20:31:56.7240947Z scale_ub_tensor = None 2025-05-07T20:31:56.7241184Z 2025-05-07T20:31:56.7241409Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7241713Z op = silu_mul_quant 2025-05-07T20:31:56.7241952Z if compiled: 2025-05-07T20:31:56.7242194Z op = torch.compile(op) 2025-05-07T20:31:56.7242611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7242881Z 2025-05-07T20:31:56.7243065Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.7243333Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.7243624Z 2025-05-07T20:31:56.7243852Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7244171Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.7244451Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.7244754Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.7245103Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.7245400Z 2025-05-07T20:31:56.7245592Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.7245778Z 2025-05-07T20:31:56.7245878Z moe/activation_test.py:126: 2025-05-07T20:31:56.7246157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7246491Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.7246801Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.7247567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.7248305Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.7248834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.7249494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.7250160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.7250861Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.7251567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.7252190Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.7252772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.7253280Z fn() 2025-05-07T20:31:56.7253885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.7254442Z self.fn.run( 2025-05-07T20:31:56.7254950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.7255467Z kernel = self.compile( 2025-05-07T20:31:56.7255991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.7256621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7257007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7257235Z 2025-05-07T20:31:56.7257444Z self = 2025-05-07T20:31:56.7258500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.7259922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55a3600>} 2025-05-07T20:31:56.7261231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.7262221Z context = 2025-05-07T20:31:56.7262499Z 2025-05-07T20:31:56.7262736Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.7263237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7263691Z module_map=module_map) 2025-05-07T20:31:56.7264050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7264398Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.7264668Z E ^ 2025-05-07T20:31:56.7265152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7265584Z 2025-05-07T20:31:56.7265993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.7266492Z 2025-05-07T20:31:56.7266591Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.7266990Z self=, 2025-05-07T20:31:56.7267381Z T=16384, 2025-05-07T20:31:56.7267567Z D=5120, 2025-05-07T20:31:56.7267746Z scale_ub=None, 2025-05-07T20:31:56.7267951Z contiguous=True, 2025-05-07T20:31:56.7268170Z compiled=True, 2025-05-07T20:31:56.7268357Z ) 2025-05-07T20:31:56.7268671Z self = 2025-05-07T20:31:56.7269148Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:56.7269412Z 2025-05-07T20:31:56.7269487Z @given( 2025-05-07T20:31:56.7269709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.7270010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.7270296Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.7270617Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.7270935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.7271211Z ) 2025-05-07T20:31:56.7271554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.7271988Z def test_silu_mul_quant( 2025-05-07T20:31:56.7272217Z self, 2025-05-07T20:31:56.7272396Z T: int, 2025-05-07T20:31:56.7272582Z D: int, 2025-05-07T20:31:56.7272790Z scale_ub: Optional[float], 2025-05-07T20:31:56.7273051Z contiguous: bool, 2025-05-07T20:31:56.7273280Z compiled: bool, 2025-05-07T20:31:56.7273490Z ) -> None: 2025-05-07T20:31:56.7273688Z torch.manual_seed(2025) 2025-05-07T20:31:56.7273925Z 2025-05-07T20:31:56.7274188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.7274514Z 2025-05-07T20:31:56.7274696Z x_sign = torch.sign(x) 2025-05-07T20:31:56.7274977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.7275273Z x = x_sign * x_clamp 2025-05-07T20:31:56.7275501Z x0 = x[:, :D] 2025-05-07T20:31:56.7275709Z x1 = x[:, D:] 2025-05-07T20:31:56.7275908Z 2025-05-07T20:31:56.7276086Z if contiguous: 2025-05-07T20:31:56.7276305Z x0 = x0.contiguous() 2025-05-07T20:31:56.7276556Z x1 = x1.contiguous() 2025-05-07T20:31:56.7276782Z 2025-05-07T20:31:56.7276964Z if scale_ub is not None: 2025-05-07T20:31:56.7277311Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.7277628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.7277928Z ) 2025-05-07T20:31:56.7278114Z else: 2025-05-07T20:31:56.7278311Z scale_ub_tensor = None 2025-05-07T20:31:56.7278553Z 2025-05-07T20:31:56.7278775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7279079Z op = silu_mul_quant 2025-05-07T20:31:56.7279324Z if compiled: 2025-05-07T20:31:56.7279563Z op = torch.compile(op) 2025-05-07T20:31:56.7279850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.7280118Z 2025-05-07T20:31:56.7280304Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.7280654Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.7280933Z 2025-05-07T20:31:56.7281160Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.7281482Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.7281767Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.7282071Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.7282416Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.7282709Z 2025-05-07T20:31:56.7282903Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.7283090Z 2025-05-07T20:31:56.7283187Z moe/activation_test.py:126: 2025-05-07T20:31:56.7283473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7283790Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.7284104Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.7284923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.7285646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.7286179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.7286847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.7287516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.7288217Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.7288927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.7289544Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.7290133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.7290634Z fn() 2025-05-07T20:31:56.7291122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.7291687Z self.fn.run( 2025-05-07T20:31:56.7292140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.7292656Z kernel = self.compile( 2025-05-07T20:31:56.7293180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.7293901Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.7294332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.7294557Z 2025-05-07T20:31:56.7294758Z self = 2025-05-07T20:31:56.7295815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.7297235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55eaac0>} 2025-05-07T20:31:56.7298801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.7299797Z context = 2025-05-07T20:31:56.7300081Z 2025-05-07T20:31:56.7300244Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.7300896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.7301348Z module_map=module_map) 2025-05-07T20:31:56.7301700Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.7302044Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.7302301Z E ^ 2025-05-07T20:31:56.7302748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.7303182Z 2025-05-07T20:31:56.7303585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.7518641Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:31:56.7519861Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:31:56.7521159Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:31:56.7522134Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:31:56.7523211Z W0507 20:31:56.750000 275344 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:31:56.9662952Z 2025-05-07T20:31:56.9663260Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9663694Z self=, 2025-05-07T20:31:56.9664095Z T=1, 2025-05-07T20:31:56.9664277Z D=5120, 2025-05-07T20:31:56.9664512Z scale_ub=1200.0, 2025-05-07T20:31:56.9664738Z contiguous=True, 2025-05-07T20:31:56.9664966Z compiled=True, 2025-05-07T20:31:56.9665169Z ) 2025-05-07T20:31:56.9665481Z self = 2025-05-07T20:31:56.9665959Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:56.9666219Z 2025-05-07T20:31:56.9666299Z @given( 2025-05-07T20:31:56.9666521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9666827Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9667131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9667449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9667771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9668053Z ) 2025-05-07T20:31:56.9668390Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9668825Z def test_silu_mul_quant( 2025-05-07T20:31:56.9669063Z self, 2025-05-07T20:31:56.9669251Z T: int, 2025-05-07T20:31:56.9669449Z D: int, 2025-05-07T20:31:56.9669664Z scale_ub: Optional[float], 2025-05-07T20:31:56.9669922Z contiguous: bool, 2025-05-07T20:31:56.9670169Z compiled: bool, 2025-05-07T20:31:56.9670556Z ) -> None: 2025-05-07T20:31:56.9670764Z torch.manual_seed(2025) 2025-05-07T20:31:56.9670999Z 2025-05-07T20:31:56.9671263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9671598Z 2025-05-07T20:31:56.9671788Z x_sign = torch.sign(x) 2025-05-07T20:31:56.9672078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.9672376Z x = x_sign * x_clamp 2025-05-07T20:31:56.9672613Z x0 = x[:, :D] 2025-05-07T20:31:56.9672832Z x1 = x[:, D:] 2025-05-07T20:31:56.9673032Z 2025-05-07T20:31:56.9673213Z if contiguous: 2025-05-07T20:31:56.9673438Z x0 = x0.contiguous() 2025-05-07T20:31:56.9673688Z x1 = x1.contiguous() 2025-05-07T20:31:56.9674040Z 2025-05-07T20:31:56.9674229Z if scale_ub is not None: 2025-05-07T20:31:56.9674492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.9674817Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.9675126Z ) 2025-05-07T20:31:56.9675316Z else: 2025-05-07T20:31:56.9675515Z scale_ub_tensor = None 2025-05-07T20:31:56.9675760Z 2025-05-07T20:31:56.9675986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.9676288Z op = silu_mul_quant 2025-05-07T20:31:56.9676533Z if compiled: 2025-05-07T20:31:56.9676775Z op = torch.compile(op) 2025-05-07T20:31:56.9677061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.9677332Z 2025-05-07T20:31:56.9677519Z > y_fp8, y_scale = fn() 2025-05-07T20:31:56.9677681Z 2025-05-07T20:31:56.9677777Z moe/activation_test.py:117: 2025-05-07T20:31:56.9678070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9678397Z moe/activation_test.py:115: in fn 2025-05-07T20:31:56.9678672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.9679218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:56.9679771Z return fn(*args, **kwargs) 2025-05-07T20:31:56.9680420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:56.9681087Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:56.9681614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.9682278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.9682925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.9683452Z kernel = self.compile( 2025-05-07T20:31:56.9683988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.9684682Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.9685074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9685301Z 2025-05-07T20:31:56.9685505Z self = 2025-05-07T20:31:56.9686567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.9687914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4a4dbc0>} 2025-05-07T20:31:56.9689231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.9690311Z context = 2025-05-07T20:31:56.9690592Z 2025-05-07T20:31:56.9690755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.9691270Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.9691731Z module_map=module_map) 2025-05-07T20:31:56.9692089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.9692437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:56.9692693Z E ^ 2025-05-07T20:31:56.9693142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.9693580Z 2025-05-07T20:31:56.9694186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:56.9694745Z 2025-05-07T20:31:56.9694846Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:56.9695257Z self=, 2025-05-07T20:31:56.9695645Z T=1, 2025-05-07T20:31:56.9695825Z D=5120, 2025-05-07T20:31:56.9696017Z scale_ub=None, 2025-05-07T20:31:56.9696223Z contiguous=False, 2025-05-07T20:31:56.9696448Z compiled=True, 2025-05-07T20:31:56.9696648Z ) 2025-05-07T20:31:56.9696953Z self = 2025-05-07T20:31:56.9697427Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:56.9697683Z 2025-05-07T20:31:56.9697759Z @given( 2025-05-07T20:31:56.9697986Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:56.9698436Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:56.9698747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:56.9699075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:56.9699393Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:56.9699680Z ) 2025-05-07T20:31:56.9700024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:56.9700455Z def test_silu_mul_quant( 2025-05-07T20:31:56.9700704Z self, 2025-05-07T20:31:56.9700901Z T: int, 2025-05-07T20:31:56.9701091Z D: int, 2025-05-07T20:31:56.9701307Z scale_ub: Optional[float], 2025-05-07T20:31:56.9701579Z contiguous: bool, 2025-05-07T20:31:56.9701817Z compiled: bool, 2025-05-07T20:31:56.9702031Z ) -> None: 2025-05-07T20:31:56.9702246Z torch.manual_seed(2025) 2025-05-07T20:31:56.9702482Z 2025-05-07T20:31:56.9702746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:56.9703088Z 2025-05-07T20:31:56.9703285Z x_sign = torch.sign(x) 2025-05-07T20:31:56.9703566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:56.9703873Z x = x_sign * x_clamp 2025-05-07T20:31:56.9704113Z x0 = x[:, :D] 2025-05-07T20:31:56.9704327Z x1 = x[:, D:] 2025-05-07T20:31:56.9704534Z 2025-05-07T20:31:56.9704724Z if contiguous: 2025-05-07T20:31:56.9704949Z x0 = x0.contiguous() 2025-05-07T20:31:56.9705210Z x1 = x1.contiguous() 2025-05-07T20:31:56.9705457Z 2025-05-07T20:31:56.9705647Z if scale_ub is not None: 2025-05-07T20:31:56.9705921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:56.9706256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:56.9706563Z ) 2025-05-07T20:31:56.9706756Z else: 2025-05-07T20:31:56.9706968Z scale_ub_tensor = None 2025-05-07T20:31:56.9707221Z 2025-05-07T20:31:56.9707448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.9707765Z op = silu_mul_quant 2025-05-07T20:31:56.9708018Z if compiled: 2025-05-07T20:31:56.9708262Z op = torch.compile(op) 2025-05-07T20:31:56.9708558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:56.9708965Z 2025-05-07T20:31:56.9709154Z y_fp8, y_scale = fn() 2025-05-07T20:31:56.9709437Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:56.9709725Z 2025-05-07T20:31:56.9709955Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:56.9710285Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:56.9710574Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:56.9710880Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:56.9711240Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.9711548Z 2025-05-07T20:31:56.9711748Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:56.9712053Z 2025-05-07T20:31:56.9712155Z moe/activation_test.py:126: 2025-05-07T20:31:56.9712451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9712784Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:56.9713107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:56.9713881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:56.9714666Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:56.9715207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:56.9715873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:56.9716552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:56.9717280Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:56.9717995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:56.9718627Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:56.9719223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:56.9719735Z fn() 2025-05-07T20:31:56.9720228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:56.9720799Z self.fn.run( 2025-05-07T20:31:56.9721263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:56.9721791Z kernel = self.compile( 2025-05-07T20:31:56.9722322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:56.9722971Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:56.9723368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:56.9723598Z 2025-05-07T20:31:56.9723803Z self = 2025-05-07T20:31:56.9724927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:56.9726275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c30ae0>} 2025-05-07T20:31:56.9727594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:56.9728600Z context = 2025-05-07T20:31:56.9734636Z 2025-05-07T20:31:56.9734828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:56.9735467Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:56.9735928Z module_map=module_map) 2025-05-07T20:31:56.9736287Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:56.9736643Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:56.9736909Z E ^ 2025-05-07T20:31:56.9737362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:56.9737801Z 2025-05-07T20:31:56.9738212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1132429Z 2025-05-07T20:31:57.1132953Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1133394Z self=, 2025-05-07T20:31:57.1133916Z T=1, 2025-05-07T20:31:57.1134121Z D=5120, 2025-05-07T20:31:57.1134312Z scale_ub=None, 2025-05-07T20:31:57.1134527Z contiguous=True, 2025-05-07T20:31:57.1134751Z compiled=False, 2025-05-07T20:31:57.1134958Z ) 2025-05-07T20:31:57.1135268Z self = 2025-05-07T20:31:57.1135750Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:57.1136006Z 2025-05-07T20:31:57.1136089Z @given( 2025-05-07T20:31:57.1136313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1136623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1136930Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1137251Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1137578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1137868Z ) 2025-05-07T20:31:57.1138209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1138651Z def test_silu_mul_quant( 2025-05-07T20:31:57.1138891Z self, 2025-05-07T20:31:57.1139089Z T: int, 2025-05-07T20:31:57.1139282Z D: int, 2025-05-07T20:31:57.1139503Z scale_ub: Optional[float], 2025-05-07T20:31:57.1139776Z contiguous: bool, 2025-05-07T20:31:57.1140011Z compiled: bool, 2025-05-07T20:31:57.1140237Z ) -> None: 2025-05-07T20:31:57.1140455Z torch.manual_seed(2025) 2025-05-07T20:31:57.1140690Z 2025-05-07T20:31:57.1140958Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1141297Z 2025-05-07T20:31:57.1141485Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1141777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1142088Z x = x_sign * x_clamp 2025-05-07T20:31:57.1142321Z x0 = x[:, :D] 2025-05-07T20:31:57.1142539Z x1 = x[:, D:] 2025-05-07T20:31:57.1142744Z 2025-05-07T20:31:57.1142927Z if contiguous: 2025-05-07T20:31:57.1143156Z x0 = x0.contiguous() 2025-05-07T20:31:57.1143407Z x1 = x1.contiguous() 2025-05-07T20:31:57.1143646Z 2025-05-07T20:31:57.1143845Z if scale_ub is not None: 2025-05-07T20:31:57.1144134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1144490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1144795Z ) 2025-05-07T20:31:57.1144985Z else: 2025-05-07T20:31:57.1145195Z scale_ub_tensor = None 2025-05-07T20:31:57.1145445Z 2025-05-07T20:31:57.1145673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1145988Z op = silu_mul_quant 2025-05-07T20:31:57.1146237Z if compiled: 2025-05-07T20:31:57.1146491Z op = torch.compile(op) 2025-05-07T20:31:57.1146784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1147060Z 2025-05-07T20:31:57.1147251Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.1147585Z 2025-05-07T20:31:57.1147685Z moe/activation_test.py:117: 2025-05-07T20:31:57.1147980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1148311Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.1148589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1149275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.1149956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.1150489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1151236Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1151895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1152425Z kernel = self.compile( 2025-05-07T20:31:57.1152956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1153612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1154008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1154258Z 2025-05-07T20:31:57.1154491Z self = 2025-05-07T20:31:57.1155547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1156900Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a6f956c0>} 2025-05-07T20:31:57.1158222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1159226Z context = 2025-05-07T20:31:57.1159506Z 2025-05-07T20:31:57.1159672Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1160180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1160644Z module_map=module_map) 2025-05-07T20:31:57.1161006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1161353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.1161608Z E ^ 2025-05-07T20:31:57.1162071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1162512Z 2025-05-07T20:31:57.1162925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1163434Z 2025-05-07T20:31:57.1163535Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1163944Z self=, 2025-05-07T20:31:57.1164369Z T=128, 2025-05-07T20:31:57.1164573Z D=5120, 2025-05-07T20:31:57.1164765Z scale_ub=None, 2025-05-07T20:31:57.1164979Z contiguous=False, 2025-05-07T20:31:57.1165196Z compiled=True, 2025-05-07T20:31:57.1165402Z ) 2025-05-07T20:31:57.1165717Z self = 2025-05-07T20:31:57.1166198Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.1166463Z 2025-05-07T20:31:57.1166546Z @given( 2025-05-07T20:31:57.1166778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.1167091Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.1167391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.1167809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.1168136Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.1168416Z ) 2025-05-07T20:31:57.1168766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.1169205Z def test_silu_mul_quant( 2025-05-07T20:31:57.1169442Z self, 2025-05-07T20:31:57.1169632Z T: int, 2025-05-07T20:31:57.1169829Z D: int, 2025-05-07T20:31:57.1170048Z scale_ub: Optional[float], 2025-05-07T20:31:57.1170310Z contiguous: bool, 2025-05-07T20:31:57.1170548Z compiled: bool, 2025-05-07T20:31:57.1170768Z ) -> None: 2025-05-07T20:31:57.1171053Z torch.manual_seed(2025) 2025-05-07T20:31:57.1171292Z 2025-05-07T20:31:57.1171560Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.1171897Z 2025-05-07T20:31:57.1172091Z x_sign = torch.sign(x) 2025-05-07T20:31:57.1172384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.1172686Z x = x_sign * x_clamp 2025-05-07T20:31:57.1172925Z x0 = x[:, :D] 2025-05-07T20:31:57.1173145Z x1 = x[:, D:] 2025-05-07T20:31:57.1173349Z 2025-05-07T20:31:57.1173535Z if contiguous: 2025-05-07T20:31:57.1173836Z x0 = x0.contiguous() 2025-05-07T20:31:57.1174092Z x1 = x1.contiguous() 2025-05-07T20:31:57.1174333Z 2025-05-07T20:31:57.1174523Z if scale_ub is not None: 2025-05-07T20:31:57.1174798Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.1175123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.1175429Z ) 2025-05-07T20:31:57.1175629Z else: 2025-05-07T20:31:57.1175835Z scale_ub_tensor = None 2025-05-07T20:31:57.1176083Z 2025-05-07T20:31:57.1176312Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.1176622Z op = silu_mul_quant 2025-05-07T20:31:57.1176874Z if compiled: 2025-05-07T20:31:57.1177120Z op = torch.compile(op) 2025-05-07T20:31:57.1177408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1177680Z 2025-05-07T20:31:57.1177874Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.1178036Z 2025-05-07T20:31:57.1178135Z moe/activation_test.py:117: 2025-05-07T20:31:57.1178429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1178767Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.1179047Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.1179595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.1180151Z return fn(*args, **kwargs) 2025-05-07T20:31:57.1180799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.1181475Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.1182007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.1182678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.1183331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.1183860Z kernel = self.compile( 2025-05-07T20:31:57.1184393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.1185044Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.1185444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.1185671Z 2025-05-07T20:31:57.1185875Z self = 2025-05-07T20:31:57.1186937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.1188375Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c32340>} 2025-05-07T20:31:57.1189692Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.1190688Z context = 2025-05-07T20:31:57.1191071Z 2025-05-07T20:31:57.1191238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.1191754Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.1192239Z module_map=module_map) 2025-05-07T20:31:57.1192594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.1192941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.1193198Z E ^ 2025-05-07T20:31:57.1193646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.1194094Z 2025-05-07T20:31:57.1194501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.1195055Z 2025-05-07T20:31:57.1195162Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.1195574Z self=, 2025-05-07T20:31:57.1195965Z T=128, 2025-05-07T20:31:57.1196153Z D=7168, 2025-05-07T20:31:57.1196347Z scale_ub=1200.0, 2025-05-07T20:31:57.1196566Z contiguous=False, 2025-05-07T20:31:57.1196795Z compiled=False, 2025-05-07T20:31:57.2762231Z ) 2025-05-07T20:31:57.2762592Z self = 2025-05-07T20:31:57.2763085Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:57.2763421Z 2025-05-07T20:31:57.2763503Z @given( 2025-05-07T20:31:57.2763760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2764089Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2764687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2765378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2766008Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2766560Z ) 2025-05-07T20:31:57.2767251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2768103Z def test_silu_mul_quant( 2025-05-07T20:31:57.2768565Z self, 2025-05-07T20:31:57.2768942Z T: int, 2025-05-07T20:31:57.2769318Z D: int, 2025-05-07T20:31:57.2769733Z scale_ub: Optional[float], 2025-05-07T20:31:57.2770252Z contiguous: bool, 2025-05-07T20:31:57.2770705Z compiled: bool, 2025-05-07T20:31:57.2771125Z ) -> None: 2025-05-07T20:31:57.2771531Z torch.manual_seed(2025) 2025-05-07T20:31:57.2771987Z 2025-05-07T20:31:57.2772506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2773155Z 2025-05-07T20:31:57.2773518Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2774187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2774528Z x = x_sign * x_clamp 2025-05-07T20:31:57.2774772Z x0 = x[:, :D] 2025-05-07T20:31:57.2774994Z x1 = x[:, D:] 2025-05-07T20:31:57.2775207Z 2025-05-07T20:31:57.2775397Z if contiguous: 2025-05-07T20:31:57.2775624Z x0 = x0.contiguous() 2025-05-07T20:31:57.2775873Z x1 = x1.contiguous() 2025-05-07T20:31:57.2776271Z 2025-05-07T20:31:57.2776454Z if scale_ub is not None: 2025-05-07T20:31:57.2776725Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.2777045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.2777338Z ) 2025-05-07T20:31:57.2777522Z else: 2025-05-07T20:31:57.2777724Z scale_ub_tensor = None 2025-05-07T20:31:57.2777961Z 2025-05-07T20:31:57.2778187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.2778496Z op = silu_mul_quant 2025-05-07T20:31:57.2778733Z if compiled: 2025-05-07T20:31:57.2778972Z op = torch.compile(op) 2025-05-07T20:31:57.2779258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2779637Z 2025-05-07T20:31:57.2779826Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.2779994Z 2025-05-07T20:31:57.2780090Z moe/activation_test.py:117: 2025-05-07T20:31:57.2780374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2780699Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.2780971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2781644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.2782309Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.2782834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.2783497Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.2784155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.2784674Z kernel = self.compile( 2025-05-07T20:31:57.2785203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.2785852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.2786247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2786467Z 2025-05-07T20:31:57.2786670Z self = 2025-05-07T20:31:57.2787719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.2789066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c30220>} 2025-05-07T20:31:57.2790380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.2791375Z context = 2025-05-07T20:31:57.2791657Z 2025-05-07T20:31:57.2791818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.2792327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.2792790Z module_map=module_map) 2025-05-07T20:31:57.2793142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.2793486Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.2793741Z E ^ 2025-05-07T20:31:57.2794186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2794683Z 2025-05-07T20:31:57.2795089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.2795594Z 2025-05-07T20:31:57.2795787Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2796191Z self=, 2025-05-07T20:31:57.2796573Z T=128, 2025-05-07T20:31:57.2796753Z D=5120, 2025-05-07T20:31:57.2796945Z scale_ub=None, 2025-05-07T20:31:57.2797148Z contiguous=False, 2025-05-07T20:31:57.2797365Z compiled=False, 2025-05-07T20:31:57.2797562Z ) 2025-05-07T20:31:57.2797865Z self = 2025-05-07T20:31:57.2798526Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:57.2798795Z 2025-05-07T20:31:57.2798869Z @given( 2025-05-07T20:31:57.2799099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2799527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2799828Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2800151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2800470Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2800750Z ) 2025-05-07T20:31:57.2801092Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2801521Z def test_silu_mul_quant( 2025-05-07T20:31:57.2801755Z self, 2025-05-07T20:31:57.2801944Z T: int, 2025-05-07T20:31:57.2802136Z D: int, 2025-05-07T20:31:57.2802344Z scale_ub: Optional[float], 2025-05-07T20:31:57.2802608Z contiguous: bool, 2025-05-07T20:31:57.2802838Z compiled: bool, 2025-05-07T20:31:57.2803049Z ) -> None: 2025-05-07T20:31:57.2803261Z torch.manual_seed(2025) 2025-05-07T20:31:57.2803492Z 2025-05-07T20:31:57.2803755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2804087Z 2025-05-07T20:31:57.2804275Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2804555Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2804859Z x = x_sign * x_clamp 2025-05-07T20:31:57.2805098Z x0 = x[:, :D] 2025-05-07T20:31:57.2805298Z x1 = x[:, D:] 2025-05-07T20:31:57.2805503Z 2025-05-07T20:31:57.2805679Z if contiguous: 2025-05-07T20:31:57.2805894Z x0 = x0.contiguous() 2025-05-07T20:31:57.2806143Z x1 = x1.contiguous() 2025-05-07T20:31:57.2806380Z 2025-05-07T20:31:57.2806560Z if scale_ub is not None: 2025-05-07T20:31:57.2806831Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.2807157Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.2807453Z ) 2025-05-07T20:31:57.2807635Z else: 2025-05-07T20:31:57.2807837Z scale_ub_tensor = None 2025-05-07T20:31:57.2808083Z 2025-05-07T20:31:57.2808306Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.2808613Z op = silu_mul_quant 2025-05-07T20:31:57.2808862Z if compiled: 2025-05-07T20:31:57.2809096Z op = torch.compile(op) 2025-05-07T20:31:57.2809392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2809661Z 2025-05-07T20:31:57.2809842Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.2810004Z 2025-05-07T20:31:57.2810098Z moe/activation_test.py:117: 2025-05-07T20:31:57.2810383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2810707Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.2810980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2811655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.2812328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.2812856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.2813523Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.2814445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.2814986Z kernel = self.compile( 2025-05-07T20:31:57.2815511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.2816156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.2816545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2816767Z 2025-05-07T20:31:57.2816967Z self = 2025-05-07T20:31:57.2818090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.2819427Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4a4c900>} 2025-05-07T20:31:57.2820746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.2821738Z context = 2025-05-07T20:31:57.2822017Z 2025-05-07T20:31:57.2822178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.2822685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.2823142Z module_map=module_map) 2025-05-07T20:31:57.2823500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.2823842Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.2824097Z E ^ 2025-05-07T20:31:57.2824592Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2825035Z 2025-05-07T20:31:57.2825439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.2825944Z 2025-05-07T20:31:57.2826045Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.2826453Z self=, 2025-05-07T20:31:57.2826847Z T=128, 2025-05-07T20:31:57.2827027Z D=5120, 2025-05-07T20:31:57.2827216Z scale_ub=1200.0, 2025-05-07T20:31:57.2827434Z contiguous=True, 2025-05-07T20:31:57.2827644Z compiled=False, 2025-05-07T20:31:57.2827841Z ) 2025-05-07T20:31:57.2828154Z self = 2025-05-07T20:31:57.2828626Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:57.2828897Z 2025-05-07T20:31:57.2828972Z @given( 2025-05-07T20:31:57.2829199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.2829496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.2829796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.2830119Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.2830439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.2830714Z ) 2025-05-07T20:31:57.2831052Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.2831483Z def test_silu_mul_quant( 2025-05-07T20:31:57.2831707Z self, 2025-05-07T20:31:57.2831895Z T: int, 2025-05-07T20:31:57.2832083Z D: int, 2025-05-07T20:31:57.2832294Z scale_ub: Optional[float], 2025-05-07T20:31:57.2832563Z contiguous: bool, 2025-05-07T20:31:57.2832795Z compiled: bool, 2025-05-07T20:31:57.2833007Z ) -> None: 2025-05-07T20:31:57.2833218Z torch.manual_seed(2025) 2025-05-07T20:31:57.2833621Z 2025-05-07T20:31:57.2833881Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.2834217Z 2025-05-07T20:31:57.2834409Z x_sign = torch.sign(x) 2025-05-07T20:31:57.2834689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.2834990Z x = x_sign * x_clamp 2025-05-07T20:31:57.2835226Z x0 = x[:, :D] 2025-05-07T20:31:57.2835434Z x1 = x[:, D:] 2025-05-07T20:31:57.2835631Z 2025-05-07T20:31:57.2835814Z if contiguous: 2025-05-07T20:31:57.2836039Z x0 = x0.contiguous() 2025-05-07T20:31:57.2836285Z x1 = x1.contiguous() 2025-05-07T20:31:57.2836521Z 2025-05-07T20:31:57.2836708Z if scale_ub is not None: 2025-05-07T20:31:57.2837050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.2837386Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.2843582Z ) 2025-05-07T20:31:57.2843797Z else: 2025-05-07T20:31:57.2844018Z scale_ub_tensor = None 2025-05-07T20:31:57.2844271Z 2025-05-07T20:31:57.2844505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.2844823Z op = silu_mul_quant 2025-05-07T20:31:57.2845068Z if compiled: 2025-05-07T20:31:57.2845312Z op = torch.compile(op) 2025-05-07T20:31:57.2845612Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2845881Z 2025-05-07T20:31:57.2846082Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.2846248Z 2025-05-07T20:31:57.2846358Z moe/activation_test.py:117: 2025-05-07T20:31:57.2846649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2846991Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.2847270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.2847953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.2848633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.2849166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.2849837Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.2850489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.2851016Z kernel = self.compile( 2025-05-07T20:31:57.2851548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.2852193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.2852591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.2852819Z 2025-05-07T20:31:57.2853023Z self = 2025-05-07T20:31:57.2854211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.2855585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c39ee0>} 2025-05-07T20:31:57.2856898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.2857900Z context = 2025-05-07T20:31:57.2858196Z 2025-05-07T20:31:57.2858359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.2858869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.2859451Z module_map=module_map) 2025-05-07T20:31:57.2859810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.2860163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.2860419Z E ^ 2025-05-07T20:31:57.2860870Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.2861312Z 2025-05-07T20:31:57.2861725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.4388747Z 2025-05-07T20:31:57.4389081Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.4389722Z self=, 2025-05-07T20:31:57.4390131Z T=1, 2025-05-07T20:31:57.4390323Z D=7168, 2025-05-07T20:31:57.4390519Z scale_ub=1200.0, 2025-05-07T20:31:57.4390737Z contiguous=True, 2025-05-07T20:31:57.4390964Z compiled=True, 2025-05-07T20:31:57.4391168Z ) 2025-05-07T20:31:57.4391481Z self = 2025-05-07T20:31:57.4391964Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:57.4392221Z 2025-05-07T20:31:57.4392309Z @given( 2025-05-07T20:31:57.4392538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.4392844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.4393143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.4393469Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.4393784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.4394066Z ) 2025-05-07T20:31:57.4394420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.4394851Z def test_silu_mul_quant( 2025-05-07T20:31:57.4395085Z self, 2025-05-07T20:31:57.4395276Z T: int, 2025-05-07T20:31:57.4395472Z D: int, 2025-05-07T20:31:57.4395688Z scale_ub: Optional[float], 2025-05-07T20:31:57.4395954Z contiguous: bool, 2025-05-07T20:31:57.4396184Z compiled: bool, 2025-05-07T20:31:57.4396403Z ) -> None: 2025-05-07T20:31:57.4396617Z torch.manual_seed(2025) 2025-05-07T20:31:57.4396852Z 2025-05-07T20:31:57.4397149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.4397479Z 2025-05-07T20:31:57.4397670Z x_sign = torch.sign(x) 2025-05-07T20:31:57.4397955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.4398424Z x = x_sign * x_clamp 2025-05-07T20:31:57.4398665Z x0 = x[:, :D] 2025-05-07T20:31:57.4398885Z x1 = x[:, D:] 2025-05-07T20:31:57.4399088Z 2025-05-07T20:31:57.4399268Z if contiguous: 2025-05-07T20:31:57.4399493Z x0 = x0.contiguous() 2025-05-07T20:31:57.4399746Z x1 = x1.contiguous() 2025-05-07T20:31:57.4399985Z 2025-05-07T20:31:57.4400175Z if scale_ub is not None: 2025-05-07T20:31:57.4400442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.4400765Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.4401067Z ) 2025-05-07T20:31:57.4401257Z else: 2025-05-07T20:31:57.4401463Z scale_ub_tensor = None 2025-05-07T20:31:57.4401715Z 2025-05-07T20:31:57.4401947Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4402251Z op = silu_mul_quant 2025-05-07T20:31:57.4402497Z if compiled: 2025-05-07T20:31:57.4402742Z op = torch.compile(op) 2025-05-07T20:31:57.4403032Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4403313Z 2025-05-07T20:31:57.4403509Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.4403670Z 2025-05-07T20:31:57.4403773Z moe/activation_test.py:117: 2025-05-07T20:31:57.4404060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4404529Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.4404811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4405362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.4405916Z return fn(*args, **kwargs) 2025-05-07T20:31:57.4406558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.4407230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.4407756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.4408527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.4409180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.4409701Z kernel = self.compile( 2025-05-07T20:31:57.4410229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.4410875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4411270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4411492Z 2025-05-07T20:31:57.4411694Z self = 2025-05-07T20:31:57.4412761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.4414200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c3a660>} 2025-05-07T20:31:57.4415517Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.4416513Z context = 2025-05-07T20:31:57.4416793Z 2025-05-07T20:31:57.4416958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.4417466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4417923Z module_map=module_map) 2025-05-07T20:31:57.4418276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4418633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4418888Z E ^ 2025-05-07T20:31:57.4419335Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4419777Z 2025-05-07T20:31:57.4420189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.4420693Z 2025-05-07T20:31:57.4420795Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.4421199Z self=, 2025-05-07T20:31:57.4421588Z T=1, 2025-05-07T20:31:57.4421767Z D=7168, 2025-05-07T20:31:57.4421958Z scale_ub=1200.0, 2025-05-07T20:31:57.4422170Z contiguous=False, 2025-05-07T20:31:57.4422388Z compiled=True, 2025-05-07T20:31:57.4422585Z ) 2025-05-07T20:31:57.4422895Z self = 2025-05-07T20:31:57.4423369Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.4423629Z 2025-05-07T20:31:57.4423706Z @given( 2025-05-07T20:31:57.4423935Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.4424267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.4424687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.4425010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.4425328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.4425610Z ) 2025-05-07T20:31:57.4425945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.4426378Z def test_silu_mul_quant( 2025-05-07T20:31:57.4426610Z self, 2025-05-07T20:31:57.4426800Z T: int, 2025-05-07T20:31:57.4426991Z D: int, 2025-05-07T20:31:57.4427201Z scale_ub: Optional[float], 2025-05-07T20:31:57.4427462Z contiguous: bool, 2025-05-07T20:31:57.4427697Z compiled: bool, 2025-05-07T20:31:57.4428480Z ) -> None: 2025-05-07T20:31:57.4428695Z torch.manual_seed(2025) 2025-05-07T20:31:57.4428931Z 2025-05-07T20:31:57.4429195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.4429538Z 2025-05-07T20:31:57.4429728Z x_sign = torch.sign(x) 2025-05-07T20:31:57.4430007Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.4430310Z x = x_sign * x_clamp 2025-05-07T20:31:57.4430544Z x0 = x[:, :D] 2025-05-07T20:31:57.4430752Z x1 = x[:, D:] 2025-05-07T20:31:57.4430951Z 2025-05-07T20:31:57.4431131Z if contiguous: 2025-05-07T20:31:57.4431356Z x0 = x0.contiguous() 2025-05-07T20:31:57.4431607Z x1 = x1.contiguous() 2025-05-07T20:31:57.4431839Z 2025-05-07T20:31:57.4432028Z if scale_ub is not None: 2025-05-07T20:31:57.4432292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.4432626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.4432930Z ) 2025-05-07T20:31:57.4433113Z else: 2025-05-07T20:31:57.4433319Z scale_ub_tensor = None 2025-05-07T20:31:57.4433563Z 2025-05-07T20:31:57.4433784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.4434099Z op = silu_mul_quant 2025-05-07T20:31:57.4434346Z if compiled: 2025-05-07T20:31:57.4434591Z op = torch.compile(op) 2025-05-07T20:31:57.4434884Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4435157Z 2025-05-07T20:31:57.4435349Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.4435515Z 2025-05-07T20:31:57.4435615Z moe/activation_test.py:117: 2025-05-07T20:31:57.4435905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4436229Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.4436502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.4437052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.4437608Z return fn(*args, **kwargs) 2025-05-07T20:31:57.4438250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.4438928Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.4439454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.4440118Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.4440761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.4441287Z kernel = self.compile( 2025-05-07T20:31:57.4441814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.4442453Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.4442838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.4443061Z 2025-05-07T20:31:57.4443265Z self = 2025-05-07T20:31:57.4444403Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.4445742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c393a0>} 2025-05-07T20:31:57.4447048Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.4448114Z context = 2025-05-07T20:31:57.4448400Z 2025-05-07T20:31:57.4448563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.4449073Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.4449530Z module_map=module_map) 2025-05-07T20:31:57.4449887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.4450236Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.4450493Z E ^ 2025-05-07T20:31:57.4450942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.4451387Z 2025-05-07T20:31:57.4451793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.8772450Z 2025-05-07T20:31:57.8772752Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.8773182Z self=, 2025-05-07T20:31:57.8773762Z T=1, 2025-05-07T20:31:57.8773963Z D=7168, 2025-05-07T20:31:57.8774165Z scale_ub=None, 2025-05-07T20:31:57.8774393Z contiguous=False, 2025-05-07T20:31:57.8774618Z compiled=True, 2025-05-07T20:31:57.8775010Z ) 2025-05-07T20:31:57.8775650Z self = 2025-05-07T20:31:57.8776588Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:57.8777097Z 2025-05-07T20:31:57.8777247Z @given( 2025-05-07T20:31:57.8777694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.8778296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.8778891Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.8779530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.8780168Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.8780724Z ) 2025-05-07T20:31:57.8781407Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.8782270Z def test_silu_mul_quant( 2025-05-07T20:31:57.8782743Z self, 2025-05-07T20:31:57.8783107Z T: int, 2025-05-07T20:31:57.8783487Z D: int, 2025-05-07T20:31:57.8783908Z scale_ub: Optional[float], 2025-05-07T20:31:57.8784314Z contiguous: bool, 2025-05-07T20:31:57.8784591Z compiled: bool, 2025-05-07T20:31:57.8784814Z ) -> None: 2025-05-07T20:31:57.8785020Z torch.manual_seed(2025) 2025-05-07T20:31:57.8785261Z 2025-05-07T20:31:57.8785530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.8785867Z 2025-05-07T20:31:57.8786054Z x_sign = torch.sign(x) 2025-05-07T20:31:57.8786339Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.8786640Z x = x_sign * x_clamp 2025-05-07T20:31:57.8786877Z x0 = x[:, :D] 2025-05-07T20:31:57.8787094Z x1 = x[:, D:] 2025-05-07T20:31:57.8787293Z 2025-05-07T20:31:57.8787475Z if contiguous: 2025-05-07T20:31:57.8787700Z x0 = x0.contiguous() 2025-05-07T20:31:57.8788153Z x1 = x1.contiguous() 2025-05-07T20:31:57.8788385Z 2025-05-07T20:31:57.8788574Z if scale_ub is not None: 2025-05-07T20:31:57.8788840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.8789158Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.8789458Z ) 2025-05-07T20:31:57.8789651Z else: 2025-05-07T20:31:57.8789848Z scale_ub_tensor = None 2025-05-07T20:31:57.8790092Z 2025-05-07T20:31:57.8790319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.8790622Z op = silu_mul_quant 2025-05-07T20:31:57.8790861Z if compiled: 2025-05-07T20:31:57.8791101Z op = torch.compile(op) 2025-05-07T20:31:57.8791503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8791779Z 2025-05-07T20:31:57.8791970Z y_fp8, y_scale = fn() 2025-05-07T20:31:57.8792240Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:31:57.8792529Z 2025-05-07T20:31:57.8792761Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.8793085Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:31:57.8793366Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:31:57.8793669Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:31:57.8794026Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.8794327Z 2025-05-07T20:31:57.8794527Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:31:57.8794714Z 2025-05-07T20:31:57.8794813Z moe/activation_test.py:126: 2025-05-07T20:31:57.8795095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8795437Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:31:57.8795753Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:31:57.8796523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:31:57.8797255Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:31:57.8797795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.8798639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.8799305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:31:57.8800009Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:31:57.8800720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:31:57.8801344Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:31:57.8801929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:31:57.8802464Z fn() 2025-05-07T20:31:57.8802965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:31:57.8803529Z self.fn.run( 2025-05-07T20:31:57.8803986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.8804501Z kernel = self.compile( 2025-05-07T20:31:57.8805029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.8805664Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.8806050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8806280Z 2025-05-07T20:31:57.8806484Z self = 2025-05-07T20:31:57.8807542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.8809013Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c3bce0>} 2025-05-07T20:31:57.8810325Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.8811317Z context = 2025-05-07T20:31:57.8811601Z 2025-05-07T20:31:57.8811876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.8812391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.8812850Z module_map=module_map) 2025-05-07T20:31:57.8813213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.8813564Z E def _kernel_quantize_fp8_row( 2025-05-07T20:31:57.8813945Z E ^ 2025-05-07T20:31:57.8814400Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.8814841Z 2025-05-07T20:31:57.8815247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:57.8815752Z 2025-05-07T20:31:57.8815851Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:57.8816254Z self=, 2025-05-07T20:31:57.8816639Z T=1, 2025-05-07T20:31:57.8816823Z D=5120, 2025-05-07T20:31:57.8817014Z scale_ub=1200.0, 2025-05-07T20:31:57.8817226Z contiguous=False, 2025-05-07T20:31:57.8817442Z compiled=True, 2025-05-07T20:31:57.8817643Z ) 2025-05-07T20:31:57.8817950Z self = 2025-05-07T20:31:57.8818431Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:57.8818689Z 2025-05-07T20:31:57.8818765Z @given( 2025-05-07T20:31:57.8818991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:57.8819297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:57.8819595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:57.8819915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:57.8820233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:57.8820514Z ) 2025-05-07T20:31:57.8820855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:57.8821293Z def test_silu_mul_quant( 2025-05-07T20:31:57.8821525Z self, 2025-05-07T20:31:57.8821717Z T: int, 2025-05-07T20:31:57.8821904Z D: int, 2025-05-07T20:31:57.8822115Z scale_ub: Optional[float], 2025-05-07T20:31:57.8822384Z contiguous: bool, 2025-05-07T20:31:57.8822615Z compiled: bool, 2025-05-07T20:31:57.8822830Z ) -> None: 2025-05-07T20:31:57.8823035Z torch.manual_seed(2025) 2025-05-07T20:31:57.8823268Z 2025-05-07T20:31:57.8823528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:57.8823859Z 2025-05-07T20:31:57.8824046Z x_sign = torch.sign(x) 2025-05-07T20:31:57.8824324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:57.8824623Z x = x_sign * x_clamp 2025-05-07T20:31:57.8824857Z x0 = x[:, :D] 2025-05-07T20:31:57.8825061Z x1 = x[:, D:] 2025-05-07T20:31:57.8825262Z 2025-05-07T20:31:57.8825447Z if contiguous: 2025-05-07T20:31:57.8825668Z x0 = x0.contiguous() 2025-05-07T20:31:57.8825914Z x1 = x1.contiguous() 2025-05-07T20:31:57.8826147Z 2025-05-07T20:31:57.8826328Z if scale_ub is not None: 2025-05-07T20:31:57.8826684Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:57.8827013Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:57.8827307Z ) 2025-05-07T20:31:57.8827502Z else: 2025-05-07T20:31:57.8827706Z scale_ub_tensor = None 2025-05-07T20:31:57.8827949Z 2025-05-07T20:31:57.8828171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:57.8828478Z op = silu_mul_quant 2025-05-07T20:31:57.8828719Z if compiled: 2025-05-07T20:31:57.8828957Z op = torch.compile(op) 2025-05-07T20:31:57.8829247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8829517Z 2025-05-07T20:31:57.8829704Z > y_fp8, y_scale = fn() 2025-05-07T20:31:57.8829944Z 2025-05-07T20:31:57.8830042Z moe/activation_test.py:117: 2025-05-07T20:31:57.8830334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8830650Z moe/activation_test.py:115: in fn 2025-05-07T20:31:57.8830937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:57.8831481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:57.8832023Z return fn(*args, **kwargs) 2025-05-07T20:31:57.8832665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:57.8833334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:57.8833856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:57.8834518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:57.8841719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:57.8842295Z kernel = self.compile( 2025-05-07T20:31:57.8842845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:57.8843516Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:57.8843910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:57.8844140Z 2025-05-07T20:31:57.8844343Z self = 2025-05-07T20:31:57.8845404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:57.8846755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4867920>} 2025-05-07T20:31:57.8848075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:57.8849076Z context = 2025-05-07T20:31:57.8849362Z 2025-05-07T20:31:57.8849525Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:57.8850037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:57.8850499Z module_map=module_map) 2025-05-07T20:31:57.8850857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:57.8851209Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:57.8851470Z E ^ 2025-05-07T20:31:57.8851923Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:57.8852369Z 2025-05-07T20:31:57.8852776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.0239670Z 2025-05-07T20:31:58.0239939Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.0240372Z self=, 2025-05-07T20:31:58.0240845Z T=1, 2025-05-07T20:31:58.0241049Z D=5120, 2025-05-07T20:31:58.0241256Z scale_ub=1200.0, 2025-05-07T20:31:58.0241490Z contiguous=False, 2025-05-07T20:31:58.0241724Z compiled=False, 2025-05-07T20:31:58.0241933Z ) 2025-05-07T20:31:58.0242250Z self = 2025-05-07T20:31:58.0242739Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.0243008Z 2025-05-07T20:31:58.0243094Z @given( 2025-05-07T20:31:58.0243495Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.0243816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.0244133Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.0244474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.0244868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.0245156Z ) 2025-05-07T20:31:58.0245506Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.0245952Z def test_silu_mul_quant( 2025-05-07T20:31:58.0246193Z self, 2025-05-07T20:31:58.0246398Z T: int, 2025-05-07T20:31:58.0246606Z D: int, 2025-05-07T20:31:58.0246830Z scale_ub: Optional[float], 2025-05-07T20:31:58.0247107Z contiguous: bool, 2025-05-07T20:31:58.0247352Z compiled: bool, 2025-05-07T20:31:58.0247578Z ) -> None: 2025-05-07T20:31:58.0247797Z torch.manual_seed(2025) 2025-05-07T20:31:58.0248047Z 2025-05-07T20:31:58.0248315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.0248660Z 2025-05-07T20:31:58.0248861Z x_sign = torch.sign(x) 2025-05-07T20:31:58.0249156Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.0249472Z x = x_sign * x_clamp 2025-05-07T20:31:58.0249718Z x0 = x[:, :D] 2025-05-07T20:31:58.0249930Z x1 = x[:, D:] 2025-05-07T20:31:58.0250136Z 2025-05-07T20:31:58.0250321Z if contiguous: 2025-05-07T20:31:58.0250550Z x0 = x0.contiguous() 2025-05-07T20:31:58.0250803Z x1 = x1.contiguous() 2025-05-07T20:31:58.0251045Z 2025-05-07T20:31:58.0251237Z if scale_ub is not None: 2025-05-07T20:31:58.0251504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.0251838Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.0252144Z ) 2025-05-07T20:31:58.0252338Z else: 2025-05-07T20:31:58.0252553Z scale_ub_tensor = None 2025-05-07T20:31:58.0252801Z 2025-05-07T20:31:58.0253026Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.0253340Z op = silu_mul_quant 2025-05-07T20:31:58.0253594Z if compiled: 2025-05-07T20:31:58.0253944Z op = torch.compile(op) 2025-05-07T20:31:58.0254242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0254517Z 2025-05-07T20:31:58.0254709Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.0254871Z 2025-05-07T20:31:58.0254969Z moe/activation_test.py:117: 2025-05-07T20:31:58.0255260Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0255588Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.0255866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0256550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.0257231Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.0257761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.0258558Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.0259211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.0259740Z kernel = self.compile( 2025-05-07T20:31:58.0260271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.0260917Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0261313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0261537Z 2025-05-07T20:31:58.0261751Z self = 2025-05-07T20:31:58.0262898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.0264264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4d72b60>} 2025-05-07T20:31:58.0265579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.0266580Z context = 2025-05-07T20:31:58.0266862Z 2025-05-07T20:31:58.0267031Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.0267540Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0268001Z module_map=module_map) 2025-05-07T20:31:58.0268361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0268704Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0268969Z E ^ 2025-05-07T20:31:58.0269422Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.0269861Z 2025-05-07T20:31:58.0270271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.0270774Z 2025-05-07T20:31:58.0270877Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.0271284Z self=, 2025-05-07T20:31:58.0271683Z T=16384, 2025-05-07T20:31:58.0271872Z D=5120, 2025-05-07T20:31:58.0272063Z scale_ub=1200.0, 2025-05-07T20:31:58.0272284Z contiguous=False, 2025-05-07T20:31:58.0272512Z compiled=True, 2025-05-07T20:31:58.0272713Z ) 2025-05-07T20:31:58.0273029Z self = 2025-05-07T20:31:58.0273513Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.0273792Z 2025-05-07T20:31:58.0273870Z @given( 2025-05-07T20:31:58.0274097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.0274430Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.0274769Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.0275104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.0275431Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.0275714Z ) 2025-05-07T20:31:58.0276054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.0276488Z def test_silu_mul_quant( 2025-05-07T20:31:58.0276726Z self, 2025-05-07T20:31:58.0276917Z T: int, 2025-05-07T20:31:58.0277106Z D: int, 2025-05-07T20:31:58.0277321Z scale_ub: Optional[float], 2025-05-07T20:31:58.0277583Z contiguous: bool, 2025-05-07T20:31:58.0277820Z compiled: bool, 2025-05-07T20:31:58.0278128Z ) -> None: 2025-05-07T20:31:58.0278337Z torch.manual_seed(2025) 2025-05-07T20:31:58.0278570Z 2025-05-07T20:31:58.0278840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.0279176Z 2025-05-07T20:31:58.0279363Z x_sign = torch.sign(x) 2025-05-07T20:31:58.0279644Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.0279948Z x = x_sign * x_clamp 2025-05-07T20:31:58.0280177Z x0 = x[:, :D] 2025-05-07T20:31:58.0280393Z x1 = x[:, D:] 2025-05-07T20:31:58.0280594Z 2025-05-07T20:31:58.0280772Z if contiguous: 2025-05-07T20:31:58.0280999Z x0 = x0.contiguous() 2025-05-07T20:31:58.0281252Z x1 = x1.contiguous() 2025-05-07T20:31:58.0281569Z 2025-05-07T20:31:58.0281761Z if scale_ub is not None: 2025-05-07T20:31:58.0282032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.0282355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.0282665Z ) 2025-05-07T20:31:58.0282856Z else: 2025-05-07T20:31:58.0283067Z scale_ub_tensor = None 2025-05-07T20:31:58.0283311Z 2025-05-07T20:31:58.0283536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.0283848Z op = silu_mul_quant 2025-05-07T20:31:58.0284090Z if compiled: 2025-05-07T20:31:58.0284331Z op = torch.compile(op) 2025-05-07T20:31:58.0284644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0284941Z 2025-05-07T20:31:58.0285128Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.0285288Z 2025-05-07T20:31:58.0285386Z moe/activation_test.py:117: 2025-05-07T20:31:58.0285676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0286000Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.0286278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.0286824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.0287375Z return fn(*args, **kwargs) 2025-05-07T20:31:58.0288021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.0288693Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.0289215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.0289882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.0290530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.0291057Z kernel = self.compile( 2025-05-07T20:31:58.0291583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.0292228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.0292628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.0292852Z 2025-05-07T20:31:58.0293058Z self = 2025-05-07T20:31:58.0294176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.0295518Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4d72d40>} 2025-05-07T20:31:58.0296843Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.0297935Z context = 2025-05-07T20:31:58.0298371Z 2025-05-07T20:31:58.0298534Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.0299046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.0299506Z module_map=module_map) 2025-05-07T20:31:58.0299866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.0300208Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.0300462Z E ^ 2025-05-07T20:31:58.0300914Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.0301500Z 2025-05-07T20:31:58.0301912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.0302422Z 2025-05-07T20:31:58.0302525Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.0302939Z self=, 2025-05-07T20:31:58.0303338Z T=2048, 2025-05-07T20:31:58.0303521Z D=7168, 2025-05-07T20:31:58.0303711Z scale_ub=1200.0, 2025-05-07T20:31:58.0303932Z contiguous=False, 2025-05-07T20:31:58.0304150Z compiled=True, 2025-05-07T20:31:58.2147379Z ) 2025-05-07T20:31:58.2147735Z self = 2025-05-07T20:31:58.2148241Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.2148547Z 2025-05-07T20:31:58.2148639Z @given( 2025-05-07T20:31:58.2148871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.2149199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.2149512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.2149836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.2150164Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.2150457Z ) 2025-05-07T20:31:58.2150803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.2151246Z def test_silu_mul_quant( 2025-05-07T20:31:58.2151484Z self, 2025-05-07T20:31:58.2151679Z T: int, 2025-05-07T20:31:58.2151881Z D: int, 2025-05-07T20:31:58.2152100Z scale_ub: Optional[float], 2025-05-07T20:31:58.2152375Z contiguous: bool, 2025-05-07T20:31:58.2152617Z compiled: bool, 2025-05-07T20:31:58.2152843Z ) -> None: 2025-05-07T20:31:58.2153058Z torch.manual_seed(2025) 2025-05-07T20:31:58.2153296Z 2025-05-07T20:31:58.2153566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.2153913Z 2025-05-07T20:31:58.2154105Z x_sign = torch.sign(x) 2025-05-07T20:31:58.2154393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.2154702Z x = x_sign * x_clamp 2025-05-07T20:31:58.2154944Z x0 = x[:, :D] 2025-05-07T20:31:58.2155162Z x1 = x[:, D:] 2025-05-07T20:31:58.2155372Z 2025-05-07T20:31:58.2155554Z if contiguous: 2025-05-07T20:31:58.2155782Z x0 = x0.contiguous() 2025-05-07T20:31:58.2156044Z x1 = x1.contiguous() 2025-05-07T20:31:58.2156280Z 2025-05-07T20:31:58.2156476Z if scale_ub is not None: 2025-05-07T20:31:58.2156753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.2157087Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.2157394Z ) 2025-05-07T20:31:58.2157587Z else: 2025-05-07T20:31:58.2157800Z scale_ub_tensor = None 2025-05-07T20:31:58.2158051Z 2025-05-07T20:31:58.2158291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.2158606Z op = silu_mul_quant 2025-05-07T20:31:58.2158853Z if compiled: 2025-05-07T20:31:58.2159108Z op = torch.compile(op) 2025-05-07T20:31:58.2159551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2159822Z 2025-05-07T20:31:58.2160018Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.2160182Z 2025-05-07T20:31:58.2160288Z moe/activation_test.py:117: 2025-05-07T20:31:58.2160585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2160914Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.2161193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2161746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.2162297Z return fn(*args, **kwargs) 2025-05-07T20:31:58.2163061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.2163741Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.2164275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.2165000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.2165654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.2166185Z kernel = self.compile( 2025-05-07T20:31:58.2166717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.2167370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2167769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2167997Z 2025-05-07T20:31:58.2168212Z self = 2025-05-07T20:31:58.2169272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.2170628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93cd947e20>} 2025-05-07T20:31:58.2171953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.2172958Z context = 2025-05-07T20:31:58.2173242Z 2025-05-07T20:31:58.2173411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.2173989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2174452Z module_map=module_map) 2025-05-07T20:31:58.2174816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2175171Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2175430Z E ^ 2025-05-07T20:31:58.2175888Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2176331Z 2025-05-07T20:31:58.2176745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.2177252Z 2025-05-07T20:31:58.2177357Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.2177767Z self=, 2025-05-07T20:31:58.2178170Z T=1, 2025-05-07T20:31:58.2178351Z D=5120, 2025-05-07T20:31:58.2178551Z scale_ub=None, 2025-05-07T20:31:58.2178769Z contiguous=False, 2025-05-07T20:31:58.2178991Z compiled=False, 2025-05-07T20:31:58.2179193Z ) 2025-05-07T20:31:58.2179507Z self = 2025-05-07T20:31:58.2180078Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:58.2180339Z 2025-05-07T20:31:58.2180420Z @given( 2025-05-07T20:31:58.2180654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.2180968Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.2181268Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.2181601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.2181928Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.2182206Z ) 2025-05-07T20:31:58.2182553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.2182995Z def test_silu_mul_quant( 2025-05-07T20:31:58.2183306Z self, 2025-05-07T20:31:58.2183507Z T: int, 2025-05-07T20:31:58.2183706Z D: int, 2025-05-07T20:31:58.2183921Z scale_ub: Optional[float], 2025-05-07T20:31:58.2184188Z contiguous: bool, 2025-05-07T20:31:58.2184433Z compiled: bool, 2025-05-07T20:31:58.2184660Z ) -> None: 2025-05-07T20:31:58.2184870Z torch.manual_seed(2025) 2025-05-07T20:31:58.2185116Z 2025-05-07T20:31:58.2185389Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.2185729Z 2025-05-07T20:31:58.2185930Z x_sign = torch.sign(x) 2025-05-07T20:31:58.2186216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.2186521Z x = x_sign * x_clamp 2025-05-07T20:31:58.2186759Z x0 = x[:, :D] 2025-05-07T20:31:58.2187006Z x1 = x[:, D:] 2025-05-07T20:31:58.2187215Z 2025-05-07T20:31:58.2187402Z if contiguous: 2025-05-07T20:31:58.2187628Z x0 = x0.contiguous() 2025-05-07T20:31:58.2187895Z x1 = x1.contiguous() 2025-05-07T20:31:58.2188134Z 2025-05-07T20:31:58.2188324Z if scale_ub is not None: 2025-05-07T20:31:58.2188598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.2188946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.2189249Z ) 2025-05-07T20:31:58.2189443Z else: 2025-05-07T20:31:58.2189655Z scale_ub_tensor = None 2025-05-07T20:31:58.2189903Z 2025-05-07T20:31:58.2190133Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.2190446Z op = silu_mul_quant 2025-05-07T20:31:58.2190696Z if compiled: 2025-05-07T20:31:58.2190939Z op = torch.compile(op) 2025-05-07T20:31:58.2191237Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2191513Z 2025-05-07T20:31:58.2191703Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.2191868Z 2025-05-07T20:31:58.2191972Z moe/activation_test.py:117: 2025-05-07T20:31:58.2192266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2192593Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.2192870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2193555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.2194238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.2194817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.2195495Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.2196155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.2196682Z kernel = self.compile( 2025-05-07T20:31:58.2197221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.2197874Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2198435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2198803Z 2025-05-07T20:31:58.2199008Z self = 2025-05-07T20:31:58.2200070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.2201413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55eb100>} 2025-05-07T20:31:58.2202863Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.2203873Z context = 2025-05-07T20:31:58.2204157Z 2025-05-07T20:31:58.2204327Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.2204847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2205310Z module_map=module_map) 2025-05-07T20:31:58.2205666Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2206018Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2206279Z E ^ 2025-05-07T20:31:58.2206731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2207175Z 2025-05-07T20:31:58.2207589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.2208097Z 2025-05-07T20:31:58.2208202Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.2208614Z self=, 2025-05-07T20:31:58.2209012Z T=4096, 2025-05-07T20:31:58.2209203Z D=7168, 2025-05-07T20:31:58.2209398Z scale_ub=1200.0, 2025-05-07T20:31:58.2209618Z contiguous=False, 2025-05-07T20:31:58.2209842Z compiled=False, 2025-05-07T20:31:58.2210050Z ) 2025-05-07T20:31:58.2210364Z self = 2025-05-07T20:31:58.2210853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.2211127Z 2025-05-07T20:31:58.2211206Z @given( 2025-05-07T20:31:58.2211435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.2211747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.2212056Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.2212392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.2212713Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.2213002Z ) 2025-05-07T20:31:58.2213349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.2213867Z def test_silu_mul_quant( 2025-05-07T20:31:58.2214107Z self, 2025-05-07T20:31:58.2214308Z T: int, 2025-05-07T20:31:58.2220802Z D: int, 2025-05-07T20:31:58.2221066Z scale_ub: Optional[float], 2025-05-07T20:31:58.2221333Z contiguous: bool, 2025-05-07T20:31:58.2221572Z compiled: bool, 2025-05-07T20:31:58.2221793Z ) -> None: 2025-05-07T20:31:58.2222002Z torch.manual_seed(2025) 2025-05-07T20:31:58.2222241Z 2025-05-07T20:31:58.2222506Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.2222839Z 2025-05-07T20:31:58.2223028Z x_sign = torch.sign(x) 2025-05-07T20:31:58.2223326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.2223668Z x = x_sign * x_clamp 2025-05-07T20:31:58.2223917Z x0 = x[:, :D] 2025-05-07T20:31:58.2224145Z x1 = x[:, D:] 2025-05-07T20:31:58.2224360Z 2025-05-07T20:31:58.2224657Z if contiguous: 2025-05-07T20:31:58.2224880Z x0 = x0.contiguous() 2025-05-07T20:31:58.2225131Z x1 = x1.contiguous() 2025-05-07T20:31:58.2225361Z 2025-05-07T20:31:58.2225543Z if scale_ub is not None: 2025-05-07T20:31:58.2225810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.2226138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.2226443Z ) 2025-05-07T20:31:58.2226633Z else: 2025-05-07T20:31:58.2226835Z scale_ub_tensor = None 2025-05-07T20:31:58.2227083Z 2025-05-07T20:31:58.2227309Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.2227620Z op = silu_mul_quant 2025-05-07T20:31:58.2227941Z if compiled: 2025-05-07T20:31:58.2228190Z op = torch.compile(op) 2025-05-07T20:31:58.2228486Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2228758Z 2025-05-07T20:31:58.2228961Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.2229121Z 2025-05-07T20:31:58.2229226Z moe/activation_test.py:117: 2025-05-07T20:31:58.2229515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2229841Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.2230208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.2231129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.2231995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.2232518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.2233191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.2233836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.2234361Z kernel = self.compile( 2025-05-07T20:31:58.2234899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.2235540Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.2235938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.2236159Z 2025-05-07T20:31:58.2236360Z self = 2025-05-07T20:31:58.2237420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.2238763Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cdc2c0>} 2025-05-07T20:31:58.2240073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.2241072Z context = 2025-05-07T20:31:58.2241350Z 2025-05-07T20:31:58.2241510Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.2242013Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.2242472Z module_map=module_map) 2025-05-07T20:31:58.2242830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.2243172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.2243429Z E ^ 2025-05-07T20:31:58.2243878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.2244315Z 2025-05-07T20:31:58.2244812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.3785812Z 2025-05-07T20:31:58.3786098Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.3786532Z self=, 2025-05-07T20:31:58.3786987Z T=16384, 2025-05-07T20:31:58.3787190Z D=7168, 2025-05-07T20:31:58.3787384Z scale_ub=None, 2025-05-07T20:31:58.3787594Z contiguous=True, 2025-05-07T20:31:58.3787817Z compiled=True, 2025-05-07T20:31:58.3788019Z ) 2025-05-07T20:31:58.3788328Z self = 2025-05-07T20:31:58.3788972Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:31:58.3789245Z 2025-05-07T20:31:58.3789327Z @given( 2025-05-07T20:31:58.3789556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.3789874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.3790186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.3790512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.3790834Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.3791118Z ) 2025-05-07T20:31:58.3791461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.3791898Z def test_silu_mul_quant( 2025-05-07T20:31:58.3792137Z self, 2025-05-07T20:31:58.3792330Z T: int, 2025-05-07T20:31:58.3792523Z D: int, 2025-05-07T20:31:58.3792742Z scale_ub: Optional[float], 2025-05-07T20:31:58.3793013Z contiguous: bool, 2025-05-07T20:31:58.3793245Z compiled: bool, 2025-05-07T20:31:58.3793473Z ) -> None: 2025-05-07T20:31:58.3793689Z torch.manual_seed(2025) 2025-05-07T20:31:58.3793922Z 2025-05-07T20:31:58.3794194Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.3794536Z 2025-05-07T20:31:58.3794727Z x_sign = torch.sign(x) 2025-05-07T20:31:58.3795008Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.3795319Z x = x_sign * x_clamp 2025-05-07T20:31:58.3795556Z x0 = x[:, :D] 2025-05-07T20:31:58.3795766Z x1 = x[:, D:] 2025-05-07T20:31:58.3795970Z 2025-05-07T20:31:58.3796154Z if contiguous: 2025-05-07T20:31:58.3796376Z x0 = x0.contiguous() 2025-05-07T20:31:58.3796624Z x1 = x1.contiguous() 2025-05-07T20:31:58.3796858Z 2025-05-07T20:31:58.3797040Z if scale_ub is not None: 2025-05-07T20:31:58.3797308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.3797634Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.3797939Z ) 2025-05-07T20:31:58.3798125Z else: 2025-05-07T20:31:58.3798490Z scale_ub_tensor = None 2025-05-07T20:31:58.3798730Z 2025-05-07T20:31:58.3798956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3799266Z op = silu_mul_quant 2025-05-07T20:31:58.3799502Z if compiled: 2025-05-07T20:31:58.3799743Z op = torch.compile(op) 2025-05-07T20:31:58.3800033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3800297Z 2025-05-07T20:31:58.3800475Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.3800639Z 2025-05-07T20:31:58.3800736Z moe/activation_test.py:117: 2025-05-07T20:31:58.3801022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3801340Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.3801613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3802166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.3802708Z return fn(*args, **kwargs) 2025-05-07T20:31:58.3803348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.3804147Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.3804703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.3805374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.3806023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.3806544Z kernel = self.compile( 2025-05-07T20:31:58.3808520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.3809267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.3809660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3809884Z 2025-05-07T20:31:58.3810092Z self = 2025-05-07T20:31:58.3811151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.3812483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cddc60>} 2025-05-07T20:31:58.3813903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.3814955Z context = 2025-05-07T20:31:58.3815237Z 2025-05-07T20:31:58.3815406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.3815908Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.3816369Z module_map=module_map) 2025-05-07T20:31:58.3816726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.3817071Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.3817319Z E ^ 2025-05-07T20:31:58.3817766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3818204Z 2025-05-07T20:31:58.3818611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.3819110Z 2025-05-07T20:31:58.3819209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.3819621Z self=, 2025-05-07T20:31:58.3820011Z T=4096, 2025-05-07T20:31:58.3820192Z D=5120, 2025-05-07T20:31:58.3820376Z scale_ub=None, 2025-05-07T20:31:58.3820592Z contiguous=False, 2025-05-07T20:31:58.3820808Z compiled=True, 2025-05-07T20:31:58.3821000Z ) 2025-05-07T20:31:58.3821311Z self = 2025-05-07T20:31:58.3821793Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:58.3822058Z 2025-05-07T20:31:58.3822132Z @given( 2025-05-07T20:31:58.3822354Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.3822663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.3822956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.3823275Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.3823594Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.3823880Z ) 2025-05-07T20:31:58.3824216Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.3824656Z def test_silu_mul_quant( 2025-05-07T20:31:58.3825017Z self, 2025-05-07T20:31:58.3825202Z T: int, 2025-05-07T20:31:58.3825392Z D: int, 2025-05-07T20:31:58.3825605Z scale_ub: Optional[float], 2025-05-07T20:31:58.3825865Z contiguous: bool, 2025-05-07T20:31:58.3826096Z compiled: bool, 2025-05-07T20:31:58.3826308Z ) -> None: 2025-05-07T20:31:58.3826509Z torch.manual_seed(2025) 2025-05-07T20:31:58.3826744Z 2025-05-07T20:31:58.3827003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.3827327Z 2025-05-07T20:31:58.3827511Z x_sign = torch.sign(x) 2025-05-07T20:31:58.3827790Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.3828089Z x = x_sign * x_clamp 2025-05-07T20:31:58.3828427Z x0 = x[:, :D] 2025-05-07T20:31:58.3828638Z x1 = x[:, D:] 2025-05-07T20:31:58.3828839Z 2025-05-07T20:31:58.3829012Z if contiguous: 2025-05-07T20:31:58.3829237Z x0 = x0.contiguous() 2025-05-07T20:31:58.3829490Z x1 = x1.contiguous() 2025-05-07T20:31:58.3829721Z 2025-05-07T20:31:58.3829906Z if scale_ub is not None: 2025-05-07T20:31:58.3830174Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.3830497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.3830804Z ) 2025-05-07T20:31:58.3830994Z else: 2025-05-07T20:31:58.3831195Z scale_ub_tensor = None 2025-05-07T20:31:58.3831438Z 2025-05-07T20:31:58.3831662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.3831966Z op = silu_mul_quant 2025-05-07T20:31:58.3832230Z if compiled: 2025-05-07T20:31:58.3832471Z op = torch.compile(op) 2025-05-07T20:31:58.3832768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3833029Z 2025-05-07T20:31:58.3833213Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.3833371Z 2025-05-07T20:31:58.3833467Z moe/activation_test.py:117: 2025-05-07T20:31:58.3833757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3834080Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.3834350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.3834897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.3835448Z return fn(*args, **kwargs) 2025-05-07T20:31:58.3836095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.3836768Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.3837295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.3837959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.3838614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.3839138Z kernel = self.compile( 2025-05-07T20:31:58.3839662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.3840307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.3840697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.3840918Z 2025-05-07T20:31:58.3841124Z self = 2025-05-07T20:31:58.3842177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.3843508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cde980>} 2025-05-07T20:31:58.3844908Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.3845903Z context = 2025-05-07T20:31:58.3846180Z 2025-05-07T20:31:58.3846344Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.3846850Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.3847300Z module_map=module_map) 2025-05-07T20:31:58.3847653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.3848068Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.3848320Z E ^ 2025-05-07T20:31:58.3848767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.3849210Z 2025-05-07T20:31:58.3849614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.5219770Z 2025-05-07T20:31:58.5220095Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.5220524Z self=, 2025-05-07T20:31:58.5220973Z T=4096, 2025-05-07T20:31:58.5221210Z D=5120, 2025-05-07T20:31:58.5221414Z scale_ub=1200.0, 2025-05-07T20:31:58.5221642Z contiguous=False, 2025-05-07T20:31:58.5221873Z compiled=False, 2025-05-07T20:31:58.5222088Z ) 2025-05-07T20:31:58.5222405Z self = 2025-05-07T20:31:58.5222909Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.5223185Z 2025-05-07T20:31:58.5223269Z @given( 2025-05-07T20:31:58.5223501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.5223822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.5224132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.5224466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.5224852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.5225143Z ) 2025-05-07T20:31:58.5225495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.5225936Z def test_silu_mul_quant( 2025-05-07T20:31:58.5226185Z self, 2025-05-07T20:31:58.5226383Z T: int, 2025-05-07T20:31:58.5226583Z D: int, 2025-05-07T20:31:58.5226806Z scale_ub: Optional[float], 2025-05-07T20:31:58.5227083Z contiguous: bool, 2025-05-07T20:31:58.5227327Z compiled: bool, 2025-05-07T20:31:58.5227560Z ) -> None: 2025-05-07T20:31:58.5227785Z torch.manual_seed(2025) 2025-05-07T20:31:58.5228024Z 2025-05-07T20:31:58.5228300Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.5228651Z 2025-05-07T20:31:58.5228848Z x_sign = torch.sign(x) 2025-05-07T20:31:58.5229143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.5229458Z x = x_sign * x_clamp 2025-05-07T20:31:58.5229694Z x0 = x[:, :D] 2025-05-07T20:31:58.5229913Z x1 = x[:, D:] 2025-05-07T20:31:58.5230121Z 2025-05-07T20:31:58.5230306Z if contiguous: 2025-05-07T20:31:58.5230537Z x0 = x0.contiguous() 2025-05-07T20:31:58.5230800Z x1 = x1.contiguous() 2025-05-07T20:31:58.5231043Z 2025-05-07T20:31:58.5231233Z if scale_ub is not None: 2025-05-07T20:31:58.5231504Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.5231849Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.5232154Z ) 2025-05-07T20:31:58.5232350Z else: 2025-05-07T20:31:58.5232563Z scale_ub_tensor = None 2025-05-07T20:31:58.5232982Z 2025-05-07T20:31:58.5233219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.5233540Z op = silu_mul_quant 2025-05-07T20:31:58.5233788Z if compiled: 2025-05-07T20:31:58.5234039Z op = torch.compile(op) 2025-05-07T20:31:58.5234337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.5234613Z 2025-05-07T20:31:58.5234806Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.5234971Z 2025-05-07T20:31:58.5235070Z moe/activation_test.py:117: 2025-05-07T20:31:58.5235365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5235693Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.5235973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.5236775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.5237458Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.5237997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.5238672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.5239329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.5239858Z kernel = self.compile( 2025-05-07T20:31:58.5240396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.5241056Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.5241450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5241686Z 2025-05-07T20:31:58.5241892Z self = 2025-05-07T20:31:58.5242954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.5244308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cdfba0>} 2025-05-07T20:31:58.5245685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.5246686Z context = 2025-05-07T20:31:58.5246970Z 2025-05-07T20:31:58.5247142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.5247659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.5248126Z module_map=module_map) 2025-05-07T20:31:58.5248495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.5248847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.5249106Z E ^ 2025-05-07T20:31:58.5249562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.5250009Z 2025-05-07T20:31:58.5250422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.5250932Z 2025-05-07T20:31:58.5251038Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.5251449Z self=, 2025-05-07T20:31:58.5251851Z T=4096, 2025-05-07T20:31:58.5252046Z D=5120, 2025-05-07T20:31:58.5252242Z scale_ub=1200.0, 2025-05-07T20:31:58.5252465Z contiguous=False, 2025-05-07T20:31:58.5252692Z compiled=True, 2025-05-07T20:31:58.5252899Z ) 2025-05-07T20:31:58.5253306Z self = 2025-05-07T20:31:58.5253912Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.5254188Z 2025-05-07T20:31:58.5254290Z @given( 2025-05-07T20:31:58.5254523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.5254863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.5255192Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.5255519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.5255846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.5256128Z ) 2025-05-07T20:31:58.5256567Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.5257010Z def test_silu_mul_quant( 2025-05-07T20:31:58.5257247Z self, 2025-05-07T20:31:58.5257444Z T: int, 2025-05-07T20:31:58.5257644Z D: int, 2025-05-07T20:31:58.5257865Z scale_ub: Optional[float], 2025-05-07T20:31:58.5258134Z contiguous: bool, 2025-05-07T20:31:58.5258373Z compiled: bool, 2025-05-07T20:31:58.5258594Z ) -> None: 2025-05-07T20:31:58.5258808Z torch.manual_seed(2025) 2025-05-07T20:31:58.5259045Z 2025-05-07T20:31:58.5259312Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.5259660Z 2025-05-07T20:31:58.5259855Z x_sign = torch.sign(x) 2025-05-07T20:31:58.5260144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.5260449Z x = x_sign * x_clamp 2025-05-07T20:31:58.5260686Z x0 = x[:, :D] 2025-05-07T20:31:58.5260903Z x1 = x[:, D:] 2025-05-07T20:31:58.5261106Z 2025-05-07T20:31:58.5261301Z if contiguous: 2025-05-07T20:31:58.5261533Z x0 = x0.contiguous() 2025-05-07T20:31:58.5261789Z x1 = x1.contiguous() 2025-05-07T20:31:58.5262033Z 2025-05-07T20:31:58.5262232Z if scale_ub is not None: 2025-05-07T20:31:58.5262507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.5262840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.5263148Z ) 2025-05-07T20:31:58.5263339Z else: 2025-05-07T20:31:58.5263554Z scale_ub_tensor = None 2025-05-07T20:31:58.5263806Z 2025-05-07T20:31:58.5264032Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.5264349Z op = silu_mul_quant 2025-05-07T20:31:58.5264599Z if compiled: 2025-05-07T20:31:58.5264855Z op = torch.compile(op) 2025-05-07T20:31:58.5265148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.5265424Z 2025-05-07T20:31:58.5265627Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.5265788Z 2025-05-07T20:31:58.5265889Z moe/activation_test.py:117: 2025-05-07T20:31:58.5266182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5266518Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.5266798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.5267355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.5267923Z return fn(*args, **kwargs) 2025-05-07T20:31:58.5274535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.5275228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.5275758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.5276442Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.5277094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.5277616Z kernel = self.compile( 2025-05-07T20:31:58.5278147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.5278904Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.5279297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.5279524Z 2025-05-07T20:31:58.5279728Z self = 2025-05-07T20:31:58.5280780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.5282200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d94ea0>} 2025-05-07T20:31:58.5283516Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.5284520Z context = 2025-05-07T20:31:58.5284796Z 2025-05-07T20:31:58.5284961Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.5285469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.5285926Z module_map=module_map) 2025-05-07T20:31:58.5286279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.5286625Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.5286883Z E ^ 2025-05-07T20:31:58.5287334Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.5287774Z 2025-05-07T20:31:58.5288178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.5288691Z 2025-05-07T20:31:58.5288795Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.5289199Z self=, 2025-05-07T20:31:58.5289588Z T=2048, 2025-05-07T20:31:58.5289773Z D=7168, 2025-05-07T20:31:58.5289964Z scale_ub=1200.0, 2025-05-07T20:31:58.5290179Z contiguous=False, 2025-05-07T20:31:58.5290399Z compiled=False, 2025-05-07T20:31:58.7245112Z ) 2025-05-07T20:31:58.7246653Z self = 2025-05-07T20:31:58.7247536Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:58.7248004Z 2025-05-07T20:31:58.7248171Z @given( 2025-05-07T20:31:58.7248554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.7249056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.7249536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.7250059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.7250585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.7251074Z ) 2025-05-07T20:31:58.7251662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.7252375Z def test_silu_mul_quant( 2025-05-07T20:31:58.7252760Z self, 2025-05-07T20:31:58.7253060Z T: int, 2025-05-07T20:31:58.7253355Z D: int, 2025-05-07T20:31:58.7253857Z scale_ub: Optional[float], 2025-05-07T20:31:58.7254296Z contiguous: bool, 2025-05-07T20:31:58.7254683Z compiled: bool, 2025-05-07T20:31:58.7255047Z ) -> None: 2025-05-07T20:31:58.7255402Z torch.manual_seed(2025) 2025-05-07T20:31:58.7255806Z 2025-05-07T20:31:58.7256292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.7256866Z 2025-05-07T20:31:58.7257177Z x_sign = torch.sign(x) 2025-05-07T20:31:58.7258106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.7258615Z x = x_sign * x_clamp 2025-05-07T20:31:58.7258987Z x0 = x[:, :D] 2025-05-07T20:31:58.7259329Z x1 = x[:, D:] 2025-05-07T20:31:58.7259666Z 2025-05-07T20:31:58.7259941Z if contiguous: 2025-05-07T20:31:58.7260312Z x0 = x0.contiguous() 2025-05-07T20:31:58.7260720Z x1 = x1.contiguous() 2025-05-07T20:31:58.7261094Z 2025-05-07T20:31:58.7261393Z if scale_ub is not None: 2025-05-07T20:31:58.7261828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.7262341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.7262837Z ) 2025-05-07T20:31:58.7263408Z else: 2025-05-07T20:31:58.7263759Z scale_ub_tensor = None 2025-05-07T20:31:58.7264164Z 2025-05-07T20:31:58.7264521Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.7265048Z op = silu_mul_quant 2025-05-07T20:31:58.7265441Z if compiled: 2025-05-07T20:31:58.7265837Z op = torch.compile(op) 2025-05-07T20:31:58.7266271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7266668Z 2025-05-07T20:31:58.7266939Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.7267163Z 2025-05-07T20:31:58.7267309Z moe/activation_test.py:117: 2025-05-07T20:31:58.7267711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7268176Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.7268567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7269548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.7270592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.7271429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.7272444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.7273500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.7274381Z kernel = self.compile( 2025-05-07T20:31:58.7275210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.7276300Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.7276906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7277289Z 2025-05-07T20:31:58.7277644Z self = 2025-05-07T20:31:58.7279531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.7282033Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d95940>} 2025-05-07T20:31:58.7284240Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.7285909Z context = 2025-05-07T20:31:58.7286382Z 2025-05-07T20:31:58.7286648Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.7287514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.7288296Z module_map=module_map) 2025-05-07T20:31:58.7288879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.7289619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.7290038Z E ^ 2025-05-07T20:31:58.7290778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.7291506Z 2025-05-07T20:31:58.7292223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.7293052Z 2025-05-07T20:31:58.7293209Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.7293915Z self=, 2025-05-07T20:31:58.7294569Z T=1, 2025-05-07T20:31:58.7294879Z D=7168, 2025-05-07T20:31:58.7295196Z scale_ub=None, 2025-05-07T20:31:58.7295686Z contiguous=True, 2025-05-07T20:31:58.7296062Z compiled=False, 2025-05-07T20:31:58.7296412Z ) 2025-05-07T20:31:58.7296934Z self = 2025-05-07T20:31:58.7297756Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:31:58.7298565Z 2025-05-07T20:31:58.7298713Z @given( 2025-05-07T20:31:58.7299092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.7299611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.7300122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.7300680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.7301227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.7301711Z ) 2025-05-07T20:31:58.7302301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.7303053Z def test_silu_mul_quant( 2025-05-07T20:31:58.7303453Z self, 2025-05-07T20:31:58.7303782Z T: int, 2025-05-07T20:31:58.7304100Z D: int, 2025-05-07T20:31:58.7304453Z scale_ub: Optional[float], 2025-05-07T20:31:58.7304871Z contiguous: bool, 2025-05-07T20:31:58.7305241Z compiled: bool, 2025-05-07T20:31:58.7305601Z ) -> None: 2025-05-07T20:31:58.7305943Z torch.manual_seed(2025) 2025-05-07T20:31:58.7306326Z 2025-05-07T20:31:58.7306744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.7307290Z 2025-05-07T20:31:58.7307609Z x_sign = torch.sign(x) 2025-05-07T20:31:58.7308087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.7308608Z x = x_sign * x_clamp 2025-05-07T20:31:58.7309008Z x0 = x[:, :D] 2025-05-07T20:31:58.7309351Z x1 = x[:, D:] 2025-05-07T20:31:58.7309690Z 2025-05-07T20:31:58.7309989Z if contiguous: 2025-05-07T20:31:58.7310362Z x0 = x0.contiguous() 2025-05-07T20:31:58.7310808Z x1 = x1.contiguous() 2025-05-07T20:31:58.7311207Z 2025-05-07T20:31:58.7311512Z if scale_ub is not None: 2025-05-07T20:31:58.7311975Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.7312538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.7313051Z ) 2025-05-07T20:31:58.7313373Z else: 2025-05-07T20:31:58.7313723Z scale_ub_tensor = None 2025-05-07T20:31:58.7314136Z 2025-05-07T20:31:58.7314504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.7315025Z op = silu_mul_quant 2025-05-07T20:31:58.7315435Z if compiled: 2025-05-07T20:31:58.7315833Z op = torch.compile(op) 2025-05-07T20:31:58.7316327Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7316790Z 2025-05-07T20:31:58.7317092Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.7317370Z 2025-05-07T20:31:58.7317530Z moe/activation_test.py:117: 2025-05-07T20:31:58.7318031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7318583Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.7319057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7320475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.7321684Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.7322601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.7323790Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.7324946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.7325866Z kernel = self.compile( 2025-05-07T20:31:58.7326966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.7328109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.7328792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7329194Z 2025-05-07T20:31:58.7329532Z self = 2025-05-07T20:31:58.7331425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.7333995Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d96ca0>} 2025-05-07T20:31:58.7336300Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.7337659Z context = 2025-05-07T20:31:58.7338066Z 2025-05-07T20:31:58.7338296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.7339036Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.7339755Z module_map=module_map) 2025-05-07T20:31:58.7340280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.7340813Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.7341227Z E ^ 2025-05-07T20:31:58.7341972Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.7342698Z 2025-05-07T20:31:58.7343362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.7344253Z 2025-05-07T20:31:58.7344426Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.7345115Z self=, 2025-05-07T20:31:58.7345739Z T=16384, 2025-05-07T20:31:58.7346054Z D=7168, 2025-05-07T20:31:58.7346363Z scale_ub=1200.0, 2025-05-07T20:31:58.7346721Z contiguous=False, 2025-05-07T20:31:58.7347078Z compiled=True, 2025-05-07T20:31:58.7347417Z ) 2025-05-07T20:31:58.7347935Z self = 2025-05-07T20:31:58.7348766Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:58.7349194Z 2025-05-07T20:31:58.7349315Z @given( 2025-05-07T20:31:58.7349657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.7350147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.7350608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.7351141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.7351597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.7352015Z ) 2025-05-07T20:31:58.7352517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.7353362Z def test_silu_mul_quant( 2025-05-07T20:31:58.7353717Z self, 2025-05-07T20:31:58.7354004Z T: int, 2025-05-07T20:31:58.7354303Z D: int, 2025-05-07T20:31:58.7354639Z scale_ub: Optional[float], 2025-05-07T20:31:58.7355090Z contiguous: bool, 2025-05-07T20:31:58.7355452Z compiled: bool, 2025-05-07T20:31:58.7355783Z ) -> None: 2025-05-07T20:31:58.7356105Z torch.manual_seed(2025) 2025-05-07T20:31:58.7356464Z 2025-05-07T20:31:58.7356863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.7357393Z 2025-05-07T20:31:58.7357692Z x_sign = torch.sign(x) 2025-05-07T20:31:58.7358141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.7358767Z x = x_sign * x_clamp 2025-05-07T20:31:58.7359158Z x0 = x[:, :D] 2025-05-07T20:31:58.7359489Z x1 = x[:, D:] 2025-05-07T20:31:58.7359809Z 2025-05-07T20:31:58.7360097Z if contiguous: 2025-05-07T20:31:58.7360476Z x0 = x0.contiguous() 2025-05-07T20:31:58.7360872Z x1 = x1.contiguous() 2025-05-07T20:31:58.7361238Z 2025-05-07T20:31:58.7361535Z if scale_ub is not None: 2025-05-07T20:31:58.7361964Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.7362484Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.7362940Z ) 2025-05-07T20:31:58.7363243Z else: 2025-05-07T20:31:58.7363587Z scale_ub_tensor = None 2025-05-07T20:31:58.7363960Z 2025-05-07T20:31:58.7364329Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.7364840Z op = silu_mul_quant 2025-05-07T20:31:58.7365235Z if compiled: 2025-05-07T20:31:58.7365607Z op = torch.compile(op) 2025-05-07T20:31:58.7366052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7366472Z 2025-05-07T20:31:58.7366758Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.7367030Z 2025-05-07T20:31:58.7367182Z moe/activation_test.py:117: 2025-05-07T20:31:58.7367632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7368142Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.7368578Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.7369448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.7370358Z return fn(*args, **kwargs) 2025-05-07T20:31:58.7371499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.7372690Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.7373603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.7374925Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.7375993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.7376887Z kernel = self.compile( 2025-05-07T20:31:58.7377829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.7378963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.7379649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.7380042Z 2025-05-07T20:31:58.7380393Z self = 2025-05-07T20:31:58.7382299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.7384735Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d97f60>} 2025-05-07T20:31:58.7387288Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.7389097Z context = 2025-05-07T20:31:58.7389594Z 2025-05-07T20:31:58.7389880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.7390772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.7391739Z module_map=module_map) 2025-05-07T20:31:58.7392365Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.7392967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.7393398Z E ^ 2025-05-07T20:31:58.7394197Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.7395053Z 2025-05-07T20:31:58.7395783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.8690994Z 2025-05-07T20:31:58.8692186Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.8693486Z self=, 2025-05-07T20:31:58.8694634Z T=1, 2025-05-07T20:31:58.8694840Z D=7168, 2025-05-07T20:31:58.8695039Z scale_ub=None, 2025-05-07T20:31:58.8695266Z contiguous=False, 2025-05-07T20:31:58.8695505Z compiled=False, 2025-05-07T20:31:58.8695717Z ) 2025-05-07T20:31:58.8696073Z self = 2025-05-07T20:31:58.8696570Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:31:58.8696834Z 2025-05-07T20:31:58.8696918Z @given( 2025-05-07T20:31:58.8697168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.8697487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.8697797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.8698123Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.8698704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.8698997Z ) 2025-05-07T20:31:58.8699345Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.8699794Z def test_silu_mul_quant( 2025-05-07T20:31:58.8700050Z self, 2025-05-07T20:31:58.8700252Z T: int, 2025-05-07T20:31:58.8700459Z D: int, 2025-05-07T20:31:58.8700692Z scale_ub: Optional[float], 2025-05-07T20:31:58.8700963Z contiguous: bool, 2025-05-07T20:31:58.8701209Z compiled: bool, 2025-05-07T20:31:58.8701446Z ) -> None: 2025-05-07T20:31:58.8701663Z torch.manual_seed(2025) 2025-05-07T20:31:58.8701918Z 2025-05-07T20:31:58.8702196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.8702548Z 2025-05-07T20:31:58.8702745Z x_sign = torch.sign(x) 2025-05-07T20:31:58.8703045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.8703366Z x = x_sign * x_clamp 2025-05-07T20:31:58.8703609Z x0 = x[:, :D] 2025-05-07T20:31:58.8703837Z x1 = x[:, D:] 2025-05-07T20:31:58.8704059Z 2025-05-07T20:31:58.8704249Z if contiguous: 2025-05-07T20:31:58.8704491Z x0 = x0.contiguous() 2025-05-07T20:31:58.8704754Z x1 = x1.contiguous() 2025-05-07T20:31:58.8704989Z 2025-05-07T20:31:58.8705190Z if scale_ub is not None: 2025-05-07T20:31:58.8705471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.8705802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.8706116Z ) 2025-05-07T20:31:58.8706322Z else: 2025-05-07T20:31:58.8706900Z scale_ub_tensor = None 2025-05-07T20:31:58.8707158Z 2025-05-07T20:31:58.8707398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.8707712Z op = silu_mul_quant 2025-05-07T20:31:58.8707972Z if compiled: 2025-05-07T20:31:58.8708230Z op = torch.compile(op) 2025-05-07T20:31:58.8708540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8708819Z 2025-05-07T20:31:58.8709025Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.8709190Z 2025-05-07T20:31:58.8709301Z moe/activation_test.py:117: 2025-05-07T20:31:58.8709597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8710088Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.8710380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8711069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.8711763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.8712301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.8712982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.8713633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.8714168Z kernel = self.compile( 2025-05-07T20:31:58.8714709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.8715368Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.8715767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8716001Z 2025-05-07T20:31:58.8716208Z self = 2025-05-07T20:31:58.8717285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.8721236Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a149a0>} 2025-05-07T20:31:58.8722558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.8723567Z context = 2025-05-07T20:31:58.8723858Z 2025-05-07T20:31:58.8724024Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.8724544Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.8725013Z module_map=module_map) 2025-05-07T20:31:58.8725372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.8725728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.8725990Z E ^ 2025-05-07T20:31:58.8726445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.8726895Z 2025-05-07T20:31:58.8727303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.8727815Z 2025-05-07T20:31:58.8727920Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.8728344Z self=, 2025-05-07T20:31:58.8728739Z T=2048, 2025-05-07T20:31:58.8728937Z D=7168, 2025-05-07T20:31:58.8729137Z scale_ub=None, 2025-05-07T20:31:58.8729349Z contiguous=False, 2025-05-07T20:31:58.8729691Z compiled=True, 2025-05-07T20:31:58.8729900Z ) 2025-05-07T20:31:58.8730216Z self = 2025-05-07T20:31:58.8730709Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:58.8730977Z 2025-05-07T20:31:58.8731070Z @given( 2025-05-07T20:31:58.8731299Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:58.8731617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:58.8731928Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:58.8732260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:58.8732580Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:58.8732870Z ) 2025-05-07T20:31:58.8733308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:58.8733839Z def test_silu_mul_quant( 2025-05-07T20:31:58.8734107Z self, 2025-05-07T20:31:58.8734300Z T: int, 2025-05-07T20:31:58.8734506Z D: int, 2025-05-07T20:31:58.8734727Z scale_ub: Optional[float], 2025-05-07T20:31:58.8744311Z contiguous: bool, 2025-05-07T20:31:58.8744728Z compiled: bool, 2025-05-07T20:31:58.8745000Z ) -> None: 2025-05-07T20:31:58.8745224Z torch.manual_seed(2025) 2025-05-07T20:31:58.8745485Z 2025-05-07T20:31:58.8745775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:58.8746134Z 2025-05-07T20:31:58.8746334Z x_sign = torch.sign(x) 2025-05-07T20:31:58.8746638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:58.8746960Z x = x_sign * x_clamp 2025-05-07T20:31:58.8747203Z x0 = x[:, :D] 2025-05-07T20:31:58.8747441Z x1 = x[:, D:] 2025-05-07T20:31:58.8747660Z 2025-05-07T20:31:58.8747850Z if contiguous: 2025-05-07T20:31:58.8748093Z x0 = x0.contiguous() 2025-05-07T20:31:58.8748364Z x1 = x1.contiguous() 2025-05-07T20:31:58.8748615Z 2025-05-07T20:31:58.8748820Z if scale_ub is not None: 2025-05-07T20:31:58.8749104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:58.8749449Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:58.8749778Z ) 2025-05-07T20:31:58.8749988Z else: 2025-05-07T20:31:58.8750207Z scale_ub_tensor = None 2025-05-07T20:31:58.8750480Z 2025-05-07T20:31:58.8750739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:58.8751068Z op = silu_mul_quant 2025-05-07T20:31:58.8751322Z if compiled: 2025-05-07T20:31:58.8751581Z op = torch.compile(op) 2025-05-07T20:31:58.8751892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8752177Z 2025-05-07T20:31:58.8752385Z > y_fp8, y_scale = fn() 2025-05-07T20:31:58.8752554Z 2025-05-07T20:31:58.8752664Z moe/activation_test.py:117: 2025-05-07T20:31:58.8752957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8753302Z moe/activation_test.py:115: in fn 2025-05-07T20:31:58.8753585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:58.8754140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:58.8754704Z return fn(*args, **kwargs) 2025-05-07T20:31:58.8755363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:58.8756047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:58.8756578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:58.8757258Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:58.8757918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:58.8758629Z kernel = self.compile( 2025-05-07T20:31:58.8759166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:58.8759822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:58.8760227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:58.8760463Z 2025-05-07T20:31:58.8760667Z self = 2025-05-07T20:31:58.8761861Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:58.8763219Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a16160>} 2025-05-07T20:31:58.8764550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:58.8765558Z context = 2025-05-07T20:31:58.8765847Z 2025-05-07T20:31:58.8766015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:58.8766530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:58.8766990Z module_map=module_map) 2025-05-07T20:31:58.8767355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:58.8767720Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:58.8767977Z E ^ 2025-05-07T20:31:58.8768436Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:58.8768890Z 2025-05-07T20:31:58.8769300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:58.8769807Z 2025-05-07T20:31:58.8769918Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:58.8770324Z self=, 2025-05-07T20:31:58.8770729Z T=4096, 2025-05-07T20:31:58.8770921Z D=7168, 2025-05-07T20:31:58.8771115Z scale_ub=None, 2025-05-07T20:31:58.8771332Z contiguous=False, 2025-05-07T20:31:58.8771555Z compiled=True, 2025-05-07T20:31:59.1010739Z ) 2025-05-07T20:31:59.1011331Z self = 2025-05-07T20:31:59.1012073Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.1012454Z 2025-05-07T20:31:59.1012565Z @given( 2025-05-07T20:31:59.1012854Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.1013204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.1013525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.1013966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.1014299Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.1014615Z ) 2025-05-07T20:31:59.1015017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.1015535Z def test_silu_mul_quant( 2025-05-07T20:31:59.1015776Z self, 2025-05-07T20:31:59.1015978Z T: int, 2025-05-07T20:31:59.1016179Z D: int, 2025-05-07T20:31:59.1016398Z scale_ub: Optional[float], 2025-05-07T20:31:59.1016681Z contiguous: bool, 2025-05-07T20:31:59.1016926Z compiled: bool, 2025-05-07T20:31:59.1017157Z ) -> None: 2025-05-07T20:31:59.1017378Z torch.manual_seed(2025) 2025-05-07T20:31:59.1017623Z 2025-05-07T20:31:59.1017891Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.1018610Z 2025-05-07T20:31:59.1018806Z x_sign = torch.sign(x) 2025-05-07T20:31:59.1019097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.1019409Z x = x_sign * x_clamp 2025-05-07T20:31:59.1019651Z x0 = x[:, :D] 2025-05-07T20:31:59.1019875Z x1 = x[:, D:] 2025-05-07T20:31:59.1020082Z 2025-05-07T20:31:59.1020272Z if contiguous: 2025-05-07T20:31:59.1020506Z x0 = x0.contiguous() 2025-05-07T20:31:59.1020762Z x1 = x1.contiguous() 2025-05-07T20:31:59.1021013Z 2025-05-07T20:31:59.1021213Z if scale_ub is not None: 2025-05-07T20:31:59.1021483Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.1021977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.1022292Z ) 2025-05-07T20:31:59.1022485Z else: 2025-05-07T20:31:59.1022704Z scale_ub_tensor = None 2025-05-07T20:31:59.1022960Z 2025-05-07T20:31:59.1023186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.1023512Z op = silu_mul_quant 2025-05-07T20:31:59.1023768Z if compiled: 2025-05-07T20:31:59.1024013Z op = torch.compile(op) 2025-05-07T20:31:59.1024317Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1024599Z 2025-05-07T20:31:59.1024822Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.1025009Z 2025-05-07T20:31:59.1025118Z moe/activation_test.py:117: 2025-05-07T20:31:59.1025420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1025756Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.1026036Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1026606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.1027172Z return fn(*args, **kwargs) 2025-05-07T20:31:59.1027820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.1028510Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.1029048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.1029725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.1030377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.1030921Z kernel = self.compile( 2025-05-07T20:31:59.1031464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.1032122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.1032515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1032745Z 2025-05-07T20:31:59.1032952Z self = 2025-05-07T20:31:59.1034022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.1035390Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a16e80>} 2025-05-07T20:31:59.1036705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.1037720Z context = 2025-05-07T20:31:59.1038008Z 2025-05-07T20:31:59.1038173Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.1038693Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.1039239Z module_map=module_map) 2025-05-07T20:31:59.1039605Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.1039960Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.1040214Z E ^ 2025-05-07T20:31:59.1040669Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.1041113Z 2025-05-07T20:31:59.1041523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.1042024Z 2025-05-07T20:31:59.1042213Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.1042620Z self=, 2025-05-07T20:31:59.1043023Z T=16384, 2025-05-07T20:31:59.1043216Z D=5120, 2025-05-07T20:31:59.1043409Z scale_ub=1200.0, 2025-05-07T20:31:59.1043645Z contiguous=False, 2025-05-07T20:31:59.1043873Z compiled=False, 2025-05-07T20:31:59.1044080Z ) 2025-05-07T20:31:59.1044390Z self = 2025-05-07T20:31:59.1044889Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:59.1045162Z 2025-05-07T20:31:59.1045244Z @given( 2025-05-07T20:31:59.1045475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.1045788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.1046098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.1046423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.1046759Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.1047046Z ) 2025-05-07T20:31:59.1047396Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.1047828Z def test_silu_mul_quant( 2025-05-07T20:31:59.1048081Z self, 2025-05-07T20:31:59.1048279Z T: int, 2025-05-07T20:31:59.1048472Z D: int, 2025-05-07T20:31:59.1048696Z scale_ub: Optional[float], 2025-05-07T20:31:59.1048969Z contiguous: bool, 2025-05-07T20:31:59.1049200Z compiled: bool, 2025-05-07T20:31:59.1049424Z ) -> None: 2025-05-07T20:31:59.1049641Z torch.manual_seed(2025) 2025-05-07T20:31:59.1049879Z 2025-05-07T20:31:59.1050149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.1050497Z 2025-05-07T20:31:59.1050689Z x_sign = torch.sign(x) 2025-05-07T20:31:59.1050982Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.1051291Z x = x_sign * x_clamp 2025-05-07T20:31:59.1051533Z x0 = x[:, :D] 2025-05-07T20:31:59.1051754Z x1 = x[:, D:] 2025-05-07T20:31:59.1051966Z 2025-05-07T20:31:59.1052148Z if contiguous: 2025-05-07T20:31:59.1052385Z x0 = x0.contiguous() 2025-05-07T20:31:59.1052645Z x1 = x1.contiguous() 2025-05-07T20:31:59.1052889Z 2025-05-07T20:31:59.1053079Z if scale_ub is not None: 2025-05-07T20:31:59.1053352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.1053778Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.1054079Z ) 2025-05-07T20:31:59.1054273Z else: 2025-05-07T20:31:59.1054486Z scale_ub_tensor = None 2025-05-07T20:31:59.1054729Z 2025-05-07T20:31:59.1055001Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.1055324Z op = silu_mul_quant 2025-05-07T20:31:59.1055568Z if compiled: 2025-05-07T20:31:59.1055816Z op = torch.compile(op) 2025-05-07T20:31:59.1056116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1056388Z 2025-05-07T20:31:59.1056580Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.1056741Z 2025-05-07T20:31:59.1056846Z moe/activation_test.py:117: 2025-05-07T20:31:59.1057226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1057553Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.1057833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1058516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.1059187Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.1059742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.1060417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.1061179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.1061713Z kernel = self.compile( 2025-05-07T20:31:59.1062242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.1062898Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.1063295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1063520Z 2025-05-07T20:31:59.1063724Z self = 2025-05-07T20:31:59.1064807Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.1066185Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cc220>} 2025-05-07T20:31:59.1067506Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.1068517Z context = 2025-05-07T20:31:59.1068803Z 2025-05-07T20:31:59.1068967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.1069483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.1069951Z module_map=module_map) 2025-05-07T20:31:59.1070317Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.1070664Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.1070928Z E ^ 2025-05-07T20:31:59.1071392Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.1071830Z 2025-05-07T20:31:59.1072242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.1072758Z 2025-05-07T20:31:59.1072865Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.1073277Z self=, 2025-05-07T20:31:59.1073683Z T=16384, 2025-05-07T20:31:59.1073872Z D=5120, 2025-05-07T20:31:59.1074071Z scale_ub=1200.0, 2025-05-07T20:31:59.1074296Z contiguous=True, 2025-05-07T20:31:59.1074515Z compiled=True, 2025-05-07T20:31:59.1074722Z ) 2025-05-07T20:31:59.1075038Z self = 2025-05-07T20:31:59.1075520Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:59.1075794Z 2025-05-07T20:31:59.1075873Z @given( 2025-05-07T20:31:59.1076111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.1076423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.1076722Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.1077145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.1077473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.1077754Z ) 2025-05-07T20:31:59.1078105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.1078550Z def test_silu_mul_quant( 2025-05-07T20:31:59.1078788Z self, 2025-05-07T20:31:59.1078985Z T: int, 2025-05-07T20:31:59.1079185Z D: int, 2025-05-07T20:31:59.1079398Z scale_ub: Optional[float], 2025-05-07T20:31:59.1079670Z contiguous: bool, 2025-05-07T20:31:59.1079912Z compiled: bool, 2025-05-07T20:31:59.1080129Z ) -> None: 2025-05-07T20:31:59.1080350Z torch.manual_seed(2025) 2025-05-07T20:31:59.1080591Z 2025-05-07T20:31:59.1080943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.1081285Z 2025-05-07T20:31:59.1081489Z x_sign = torch.sign(x) 2025-05-07T20:31:59.1081785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.1082098Z x = x_sign * x_clamp 2025-05-07T20:31:59.1082343Z x0 = x[:, :D] 2025-05-07T20:31:59.1082564Z x1 = x[:, D:] 2025-05-07T20:31:59.1082767Z 2025-05-07T20:31:59.1082966Z if contiguous: 2025-05-07T20:31:59.1083198Z x0 = x0.contiguous() 2025-05-07T20:31:59.1083455Z x1 = x1.contiguous() 2025-05-07T20:31:59.1083706Z 2025-05-07T20:31:59.1083904Z if scale_ub is not None: 2025-05-07T20:31:59.1084176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.1084512Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.1084823Z ) 2025-05-07T20:31:59.1085018Z else: 2025-05-07T20:31:59.1085241Z scale_ub_tensor = None 2025-05-07T20:31:59.1085496Z 2025-05-07T20:31:59.1085728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.1086047Z op = silu_mul_quant 2025-05-07T20:31:59.1086308Z if compiled: 2025-05-07T20:31:59.1086559Z op = torch.compile(op) 2025-05-07T20:31:59.1086851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1087129Z 2025-05-07T20:31:59.1087327Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.1087489Z 2025-05-07T20:31:59.1087588Z moe/activation_test.py:117: 2025-05-07T20:31:59.1087888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1088226Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.1088508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.1089063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.1089625Z return fn(*args, **kwargs) 2025-05-07T20:31:59.1090278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.1090954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.1091490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.1092164Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.1092812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.1093340Z kernel = self.compile( 2025-05-07T20:31:59.1093970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.1094621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.1095015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.1095247Z 2025-05-07T20:31:59.1095453Z self = 2025-05-07T20:31:59.1096515Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.1097946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cd4e0>} 2025-05-07T20:31:59.1099566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.1100565Z context = 2025-05-07T20:31:59.1100853Z 2025-05-07T20:31:59.1101158Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.1101674Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.1102142Z module_map=module_map) 2025-05-07T20:31:59.1102500Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.1102855Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.1103114Z E ^ 2025-05-07T20:31:59.1103563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.1104009Z 2025-05-07T20:31:59.1104414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.2652582Z 2025-05-07T20:31:59.2653372Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.2654387Z self=, 2025-05-07T20:31:59.2654929Z T=16384, 2025-05-07T20:31:59.2655171Z D=5120, 2025-05-07T20:31:59.2655375Z scale_ub=None, 2025-05-07T20:31:59.2655597Z contiguous=False, 2025-05-07T20:31:59.2655828Z compiled=True, 2025-05-07T20:31:59.2656045Z ) 2025-05-07T20:31:59.2656372Z self = 2025-05-07T20:31:59.2656867Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.2657140Z 2025-05-07T20:31:59.2657220Z @given( 2025-05-07T20:31:59.2657460Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.2657780Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.2658080Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.2658409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.2658739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.2659026Z ) 2025-05-07T20:31:59.2659372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.2659810Z def test_silu_mul_quant( 2025-05-07T20:31:59.2660055Z self, 2025-05-07T20:31:59.2660245Z T: int, 2025-05-07T20:31:59.2660446Z D: int, 2025-05-07T20:31:59.2660668Z scale_ub: Optional[float], 2025-05-07T20:31:59.2660935Z contiguous: bool, 2025-05-07T20:31:59.2661173Z compiled: bool, 2025-05-07T20:31:59.2661399Z ) -> None: 2025-05-07T20:31:59.2661612Z torch.manual_seed(2025) 2025-05-07T20:31:59.2661856Z 2025-05-07T20:31:59.2662131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.2662467Z 2025-05-07T20:31:59.2662665Z x_sign = torch.sign(x) 2025-05-07T20:31:59.2662955Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.2663260Z x = x_sign * x_clamp 2025-05-07T20:31:59.2663505Z x0 = x[:, :D] 2025-05-07T20:31:59.2663722Z x1 = x[:, D:] 2025-05-07T20:31:59.2663928Z 2025-05-07T20:31:59.2664123Z if contiguous: 2025-05-07T20:31:59.2664356Z x0 = x0.contiguous() 2025-05-07T20:31:59.2664616Z x1 = x1.contiguous() 2025-05-07T20:31:59.2664852Z 2025-05-07T20:31:59.2665363Z if scale_ub is not None: 2025-05-07T20:31:59.2665654Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.2665983Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.2666292Z ) 2025-05-07T20:31:59.2666494Z else: 2025-05-07T20:31:59.2666702Z scale_ub_tensor = None 2025-05-07T20:31:59.2666955Z 2025-05-07T20:31:59.2667190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.2667498Z op = silu_mul_quant 2025-05-07T20:31:59.2667747Z if compiled: 2025-05-07T20:31:59.2667999Z op = torch.compile(op) 2025-05-07T20:31:59.2668302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.2668579Z 2025-05-07T20:31:59.2668902Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.2669075Z 2025-05-07T20:31:59.2669174Z moe/activation_test.py:117: 2025-05-07T20:31:59.2669470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2669799Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.2670083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.2670639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.2678666Z return fn(*args, **kwargs) 2025-05-07T20:31:59.2679346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.2680031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.2680571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.2681262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.2681921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.2682450Z kernel = self.compile( 2025-05-07T20:31:59.2683003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.2683662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.2684057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2684297Z 2025-05-07T20:31:59.2684505Z self = 2025-05-07T20:31:59.2685625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.2686988Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43ce2a0>} 2025-05-07T20:31:59.2688790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.2689795Z context = 2025-05-07T20:31:59.2690085Z 2025-05-07T20:31:59.2690249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.2690763Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.2691224Z module_map=module_map) 2025-05-07T20:31:59.2691580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.2691934Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.2692195Z E ^ 2025-05-07T20:31:59.2692648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.2693093Z 2025-05-07T20:31:59.2693499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.2694250Z 2025-05-07T20:31:59.2694353Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.2694764Z self=, 2025-05-07T20:31:59.2695153Z T=2048, 2025-05-07T20:31:59.2695342Z D=5120, 2025-05-07T20:31:59.2695535Z scale_ub=None, 2025-05-07T20:31:59.2695744Z contiguous=False, 2025-05-07T20:31:59.2695969Z compiled=True, 2025-05-07T20:31:59.2696173Z ) 2025-05-07T20:31:59.2696480Z self = 2025-05-07T20:31:59.2696966Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:31:59.2697318Z 2025-05-07T20:31:59.2697396Z @given( 2025-05-07T20:31:59.2697628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.2697928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.2698436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.2698763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.2699078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.2699360Z ) 2025-05-07T20:31:59.2699703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.2700129Z def test_silu_mul_quant( 2025-05-07T20:31:59.2700367Z self, 2025-05-07T20:31:59.2700560Z T: int, 2025-05-07T20:31:59.2700744Z D: int, 2025-05-07T20:31:59.2700959Z scale_ub: Optional[float], 2025-05-07T20:31:59.2701220Z contiguous: bool, 2025-05-07T20:31:59.2701457Z compiled: bool, 2025-05-07T20:31:59.2701674Z ) -> None: 2025-05-07T20:31:59.2701884Z torch.manual_seed(2025) 2025-05-07T20:31:59.2702123Z 2025-05-07T20:31:59.2702387Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.2702717Z 2025-05-07T20:31:59.2702909Z x_sign = torch.sign(x) 2025-05-07T20:31:59.2703197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.2703496Z x = x_sign * x_clamp 2025-05-07T20:31:59.2703736Z x0 = x[:, :D] 2025-05-07T20:31:59.2703946Z x1 = x[:, D:] 2025-05-07T20:31:59.2704148Z 2025-05-07T20:31:59.2704323Z if contiguous: 2025-05-07T20:31:59.2704551Z x0 = x0.contiguous() 2025-05-07T20:31:59.2704805Z x1 = x1.contiguous() 2025-05-07T20:31:59.2705034Z 2025-05-07T20:31:59.2705226Z if scale_ub is not None: 2025-05-07T20:31:59.2705491Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.2705806Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.2706109Z ) 2025-05-07T20:31:59.2706302Z else: 2025-05-07T20:31:59.2706500Z scale_ub_tensor = None 2025-05-07T20:31:59.2706742Z 2025-05-07T20:31:59.2706967Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.2707278Z op = silu_mul_quant 2025-05-07T20:31:59.2707520Z if compiled: 2025-05-07T20:31:59.2707762Z op = torch.compile(op) 2025-05-07T20:31:59.2708045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.2708316Z 2025-05-07T20:31:59.2708502Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.2708658Z 2025-05-07T20:31:59.2708762Z moe/activation_test.py:117: 2025-05-07T20:31:59.2709046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2709368Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.2709647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.2710193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.2710739Z return fn(*args, **kwargs) 2025-05-07T20:31:59.2711380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.2712196Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.2712716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.2713380Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.2714033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.2714548Z kernel = self.compile( 2025-05-07T20:31:59.2715124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.2715774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.2716271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.2716496Z 2025-05-07T20:31:59.2716695Z self = 2025-05-07T20:31:59.2717759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.2719101Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cf560>} 2025-05-07T20:31:59.2720418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.2721422Z context = 2025-05-07T20:31:59.2721703Z 2025-05-07T20:31:59.2721867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.2722379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.2722842Z module_map=module_map) 2025-05-07T20:31:59.2723195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.2723543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.2723801Z E ^ 2025-05-07T20:31:59.2724249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.2724691Z 2025-05-07T20:31:59.2725137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4314414Z 2025-05-07T20:31:59.4314580Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.4315042Z self=, 2025-05-07T20:31:59.4315833Z T=2048, 2025-05-07T20:31:59.4316205Z D=5120, 2025-05-07T20:31:59.4316586Z scale_ub=1200.0, 2025-05-07T20:31:59.4317013Z contiguous=False, 2025-05-07T20:31:59.4317454Z compiled=True, 2025-05-07T20:31:59.4317856Z ) 2025-05-07T20:31:59.4318474Z self = 2025-05-07T20:31:59.4319454Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:59.4320001Z 2025-05-07T20:31:59.4320152Z @given( 2025-05-07T20:31:59.4320607Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4321207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4321805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4322453Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4323087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4323656Z ) 2025-05-07T20:31:59.4324336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4325197Z def test_silu_mul_quant( 2025-05-07T20:31:59.4325518Z self, 2025-05-07T20:31:59.4325893Z T: int, 2025-05-07T20:31:59.4326086Z D: int, 2025-05-07T20:31:59.4326303Z scale_ub: Optional[float], 2025-05-07T20:31:59.4326572Z contiguous: bool, 2025-05-07T20:31:59.4326812Z compiled: bool, 2025-05-07T20:31:59.4327032Z ) -> None: 2025-05-07T20:31:59.4327247Z torch.manual_seed(2025) 2025-05-07T20:31:59.4327485Z 2025-05-07T20:31:59.4327749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4328090Z 2025-05-07T20:31:59.4328285Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4328568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4328877Z x = x_sign * x_clamp 2025-05-07T20:31:59.4329118Z x0 = x[:, :D] 2025-05-07T20:31:59.4329446Z x1 = x[:, D:] 2025-05-07T20:31:59.4329662Z 2025-05-07T20:31:59.4329848Z if contiguous: 2025-05-07T20:31:59.4330073Z x0 = x0.contiguous() 2025-05-07T20:31:59.4330332Z x1 = x1.contiguous() 2025-05-07T20:31:59.4330578Z 2025-05-07T20:31:59.4330766Z if scale_ub is not None: 2025-05-07T20:31:59.4331040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4331376Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4331688Z ) 2025-05-07T20:31:59.4331879Z else: 2025-05-07T20:31:59.4332094Z scale_ub_tensor = None 2025-05-07T20:31:59.4332351Z 2025-05-07T20:31:59.4332576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4332891Z op = silu_mul_quant 2025-05-07T20:31:59.4333142Z if compiled: 2025-05-07T20:31:59.4333382Z op = torch.compile(op) 2025-05-07T20:31:59.4333785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4334059Z 2025-05-07T20:31:59.4334243Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4334414Z 2025-05-07T20:31:59.4334511Z moe/activation_test.py:117: 2025-05-07T20:31:59.4334805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4335136Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4335422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4335979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.4336537Z return fn(*args, **kwargs) 2025-05-07T20:31:59.4337184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4337861Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4338393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4339067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4339718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4340252Z kernel = self.compile( 2025-05-07T20:31:59.4340788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4341429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4341826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4342056Z 2025-05-07T20:31:59.4342261Z self = 2025-05-07T20:31:59.4343325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4344663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397618c20>} 2025-05-07T20:31:59.4346117Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4347121Z context = 2025-05-07T20:31:59.4347399Z 2025-05-07T20:31:59.4347570Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4348092Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4348551Z module_map=module_map) 2025-05-07T20:31:59.4348919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4349376Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4349633Z E ^ 2025-05-07T20:31:59.4350098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4350544Z 2025-05-07T20:31:59.4350959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.4351465Z 2025-05-07T20:31:59.4351576Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.4351980Z self=, 2025-05-07T20:31:59.4352382Z T=4096, 2025-05-07T20:31:59.4352575Z D=5120, 2025-05-07T20:31:59.4352766Z scale_ub=1200.0, 2025-05-07T20:31:59.4352988Z contiguous=True, 2025-05-07T20:31:59.4353215Z compiled=True, 2025-05-07T20:31:59.4353412Z ) 2025-05-07T20:31:59.4353732Z self = 2025-05-07T20:31:59.4354230Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:59.4354497Z 2025-05-07T20:31:59.4354585Z @given( 2025-05-07T20:31:59.4354814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.4355130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.4355437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.4355760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.4356088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.4356369Z ) 2025-05-07T20:31:59.4356708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.4357151Z def test_silu_mul_quant( 2025-05-07T20:31:59.4357396Z self, 2025-05-07T20:31:59.4357585Z T: int, 2025-05-07T20:31:59.4357784Z D: int, 2025-05-07T20:31:59.4358002Z scale_ub: Optional[float], 2025-05-07T20:31:59.4358269Z contiguous: bool, 2025-05-07T20:31:59.4358520Z compiled: bool, 2025-05-07T20:31:59.4358745Z ) -> None: 2025-05-07T20:31:59.4358957Z torch.manual_seed(2025) 2025-05-07T20:31:59.4359205Z 2025-05-07T20:31:59.4359474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.4359819Z 2025-05-07T20:31:59.4360018Z x_sign = torch.sign(x) 2025-05-07T20:31:59.4360309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.4360614Z x = x_sign * x_clamp 2025-05-07T20:31:59.4360852Z x0 = x[:, :D] 2025-05-07T20:31:59.4361069Z x1 = x[:, D:] 2025-05-07T20:31:59.4361272Z 2025-05-07T20:31:59.4361458Z if contiguous: 2025-05-07T20:31:59.4361692Z x0 = x0.contiguous() 2025-05-07T20:31:59.4361952Z x1 = x1.contiguous() 2025-05-07T20:31:59.4362195Z 2025-05-07T20:31:59.4362389Z if scale_ub is not None: 2025-05-07T20:31:59.4362666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.4363001Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.4363312Z ) 2025-05-07T20:31:59.4363509Z else: 2025-05-07T20:31:59.4363718Z scale_ub_tensor = None 2025-05-07T20:31:59.4363976Z 2025-05-07T20:31:59.4364292Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.4364601Z op = silu_mul_quant 2025-05-07T20:31:59.4364864Z if compiled: 2025-05-07T20:31:59.4365151Z op = torch.compile(op) 2025-05-07T20:31:59.4365443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4365722Z 2025-05-07T20:31:59.4365917Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.4366080Z 2025-05-07T20:31:59.4366177Z moe/activation_test.py:117: 2025-05-07T20:31:59.4366477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4366807Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.4367094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.4367723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.4368284Z return fn(*args, **kwargs) 2025-05-07T20:31:59.4368936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.4369611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.4370145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.4370821Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.4371481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.4372007Z kernel = self.compile( 2025-05-07T20:31:59.4372548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.4373205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.4373604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.4373976Z 2025-05-07T20:31:59.4374182Z self = 2025-05-07T20:31:59.4375252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.4376643Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397619a80>} 2025-05-07T20:31:59.4377963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.4378965Z context = 2025-05-07T20:31:59.4379251Z 2025-05-07T20:31:59.4379418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.4379938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.4380404Z module_map=module_map) 2025-05-07T20:31:59.4380763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.4381123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.4381388Z E ^ 2025-05-07T20:31:59.4381840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.4382286Z 2025-05-07T20:31:59.4382696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.6063910Z 2025-05-07T20:31:59.6064672Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.6065300Z self=, 2025-05-07T20:31:59.6065747Z T=128, 2025-05-07T20:31:59.6065955Z D=5120, 2025-05-07T20:31:59.6066991Z scale_ub=1200.0, 2025-05-07T20:31:59.6067228Z contiguous=False, 2025-05-07T20:31:59.6067470Z compiled=True, 2025-05-07T20:31:59.6067691Z ) 2025-05-07T20:31:59.6068021Z self = 2025-05-07T20:31:59.6068557Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:59.6068829Z 2025-05-07T20:31:59.6068909Z @given( 2025-05-07T20:31:59.6069156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.6069476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.6069789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.6070126Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.6070618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.6070922Z ) 2025-05-07T20:31:59.6071272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.6071720Z def test_silu_mul_quant( 2025-05-07T20:31:59.6071978Z self, 2025-05-07T20:31:59.6072176Z T: int, 2025-05-07T20:31:59.6072383Z D: int, 2025-05-07T20:31:59.6072610Z scale_ub: Optional[float], 2025-05-07T20:31:59.6072884Z contiguous: bool, 2025-05-07T20:31:59.6073132Z compiled: bool, 2025-05-07T20:31:59.6073370Z ) -> None: 2025-05-07T20:31:59.6073587Z torch.manual_seed(2025) 2025-05-07T20:31:59.6073842Z 2025-05-07T20:31:59.6074121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.6074467Z 2025-05-07T20:31:59.6074677Z x_sign = torch.sign(x) 2025-05-07T20:31:59.6074992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.6075320Z x = x_sign * x_clamp 2025-05-07T20:31:59.6075579Z x0 = x[:, :D] 2025-05-07T20:31:59.6075818Z x1 = x[:, D:] 2025-05-07T20:31:59.6076048Z 2025-05-07T20:31:59.6076244Z if contiguous: 2025-05-07T20:31:59.6076492Z x0 = x0.contiguous() 2025-05-07T20:31:59.6076771Z x1 = x1.contiguous() 2025-05-07T20:31:59.6077021Z 2025-05-07T20:31:59.6077227Z if scale_ub is not None: 2025-05-07T20:31:59.6077517Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.6077855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.6078181Z ) 2025-05-07T20:31:59.6078391Z else: 2025-05-07T20:31:59.6078611Z scale_ub_tensor = None 2025-05-07T20:31:59.6078876Z 2025-05-07T20:31:59.6079125Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.6079446Z op = silu_mul_quant 2025-05-07T20:31:59.6079712Z if compiled: 2025-05-07T20:31:59.6079979Z op = torch.compile(op) 2025-05-07T20:31:59.6080283Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6080577Z 2025-05-07T20:31:59.6080790Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.6080958Z 2025-05-07T20:31:59.6081069Z moe/activation_test.py:117: 2025-05-07T20:31:59.6081373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6081715Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.6082014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6082575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.6083144Z return fn(*args, **kwargs) 2025-05-07T20:31:59.6083810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.6084503Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.6085041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.6085726Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.6086393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.6087042Z kernel = self.compile( 2025-05-07T20:31:59.6087595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.6088258Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6088669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6088900Z 2025-05-07T20:31:59.6089111Z self = 2025-05-07T20:31:59.6090261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.6091634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f939761aca0>} 2025-05-07T20:31:59.6092971Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.6094144Z context = 2025-05-07T20:31:59.6094440Z 2025-05-07T20:31:59.6094612Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.6095137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6095611Z module_map=module_map) 2025-05-07T20:31:59.6095985Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6096354Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6096628Z E ^ 2025-05-07T20:31:59.6097091Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6097548Z 2025-05-07T20:31:59.6097961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.6098750Z 2025-05-07T20:31:59.6098861Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.6099286Z self=, 2025-05-07T20:31:59.6099721Z T=16384, 2025-05-07T20:31:59.6099933Z D=7168, 2025-05-07T20:31:59.6100134Z scale_ub=1200.0, 2025-05-07T20:31:59.6100373Z contiguous=True, 2025-05-07T20:31:59.6100612Z compiled=True, 2025-05-07T20:31:59.6100833Z ) 2025-05-07T20:31:59.6101175Z self = 2025-05-07T20:31:59.6101682Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:31:59.6101955Z 2025-05-07T20:31:59.6102038Z @given( 2025-05-07T20:31:59.6112118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.6112510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.6112832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.6113163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.6113498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.6113795Z ) 2025-05-07T20:31:59.6114148Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.6114603Z def test_silu_mul_quant( 2025-05-07T20:31:59.6114858Z self, 2025-05-07T20:31:59.6115056Z T: int, 2025-05-07T20:31:59.6115263Z D: int, 2025-05-07T20:31:59.6115493Z scale_ub: Optional[float], 2025-05-07T20:31:59.6115772Z contiguous: bool, 2025-05-07T20:31:59.6116021Z compiled: bool, 2025-05-07T20:31:59.6116257Z ) -> None: 2025-05-07T20:31:59.6116472Z torch.manual_seed(2025) 2025-05-07T20:31:59.6116726Z 2025-05-07T20:31:59.6117212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.6117566Z 2025-05-07T20:31:59.6117760Z x_sign = torch.sign(x) 2025-05-07T20:31:59.6118063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.6118384Z x = x_sign * x_clamp 2025-05-07T20:31:59.6118626Z x0 = x[:, :D] 2025-05-07T20:31:59.6118855Z x1 = x[:, D:] 2025-05-07T20:31:59.6119074Z 2025-05-07T20:31:59.6119262Z if contiguous: 2025-05-07T20:31:59.6119504Z x0 = x0.contiguous() 2025-05-07T20:31:59.6119773Z x1 = x1.contiguous() 2025-05-07T20:31:59.6120014Z 2025-05-07T20:31:59.6120217Z if scale_ub is not None: 2025-05-07T20:31:59.6120628Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.6120964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.6121278Z ) 2025-05-07T20:31:59.6121483Z else: 2025-05-07T20:31:59.6121697Z scale_ub_tensor = None 2025-05-07T20:31:59.6121966Z 2025-05-07T20:31:59.6122208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.6122535Z op = silu_mul_quant 2025-05-07T20:31:59.6122785Z if compiled: 2025-05-07T20:31:59.6123041Z op = torch.compile(op) 2025-05-07T20:31:59.6123348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6123624Z 2025-05-07T20:31:59.6123826Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.6123991Z 2025-05-07T20:31:59.6124105Z moe/activation_test.py:117: 2025-05-07T20:31:59.6124391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6124732Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.6125023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.6125582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.6126146Z return fn(*args, **kwargs) 2025-05-07T20:31:59.6126799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.6127485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.6128026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.6128701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.6129357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.6129893Z kernel = self.compile( 2025-05-07T20:31:59.6130440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.6131090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.6131493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.6131736Z 2025-05-07T20:31:59.6131942Z self = 2025-05-07T20:31:59.6133014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.6134501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978e8400>} 2025-05-07T20:31:59.6135830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.6136844Z context = 2025-05-07T20:31:59.6137138Z 2025-05-07T20:31:59.6137305Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.6137915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.6138377Z module_map=module_map) 2025-05-07T20:31:59.6138752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.6139112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.6139371Z E ^ 2025-05-07T20:31:59.6139834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.6140286Z 2025-05-07T20:31:59.6140695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.7285909Z 2025-05-07T20:31:59.7286277Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.7286884Z self=, 2025-05-07T20:31:59.7287322Z T=16384, 2025-05-07T20:31:59.7287543Z D=5120, 2025-05-07T20:31:59.7287753Z scale_ub=1200.0, 2025-05-07T20:31:59.7287980Z contiguous=True, 2025-05-07T20:31:59.7288214Z compiled=False, 2025-05-07T20:31:59.7288436Z ) 2025-05-07T20:31:59.7288755Z self = 2025-05-07T20:31:59.7289263Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:31:59.7289549Z 2025-05-07T20:31:59.7289631Z @given( 2025-05-07T20:31:59.7289874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.7290191Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.7290505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.7290851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.7291181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.7291477Z ) 2025-05-07T20:31:59.7291833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.7292287Z def test_silu_mul_quant( 2025-05-07T20:31:59.7292530Z self, 2025-05-07T20:31:59.7292737Z T: int, 2025-05-07T20:31:59.7292945Z D: int, 2025-05-07T20:31:59.7293166Z scale_ub: Optional[float], 2025-05-07T20:31:59.7293450Z contiguous: bool, 2025-05-07T20:31:59.7293808Z compiled: bool, 2025-05-07T20:31:59.7294042Z ) -> None: 2025-05-07T20:31:59.7294270Z torch.manual_seed(2025) 2025-05-07T20:31:59.7294524Z 2025-05-07T20:31:59.7294796Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.7295154Z 2025-05-07T20:31:59.7295362Z x_sign = torch.sign(x) 2025-05-07T20:31:59.7295658Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.7295980Z x = x_sign * x_clamp 2025-05-07T20:31:59.7296233Z x0 = x[:, :D] 2025-05-07T20:31:59.7296452Z x1 = x[:, D:] 2025-05-07T20:31:59.7296684Z 2025-05-07T20:31:59.7296889Z if contiguous: 2025-05-07T20:31:59.7297128Z x0 = x0.contiguous() 2025-05-07T20:31:59.7297398Z x1 = x1.contiguous() 2025-05-07T20:31:59.7297655Z 2025-05-07T20:31:59.7297853Z if scale_ub is not None: 2025-05-07T20:31:59.7298143Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.7298755Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.7299080Z ) 2025-05-07T20:31:59.7299281Z else: 2025-05-07T20:31:59.7299510Z scale_ub_tensor = None 2025-05-07T20:31:59.7299774Z 2025-05-07T20:31:59.7300013Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.7300345Z op = silu_mul_quant 2025-05-07T20:31:59.7300615Z if compiled: 2025-05-07T20:31:59.7300864Z op = torch.compile(op) 2025-05-07T20:31:59.7301177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7301463Z 2025-05-07T20:31:59.7301659Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.7302026Z 2025-05-07T20:31:59.7302131Z moe/activation_test.py:117: 2025-05-07T20:31:59.7302435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7302777Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.7303062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7303756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.7304453Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.7304993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.7305782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.7306452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.7306992Z kernel = self.compile( 2025-05-07T20:31:59.7307535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.7308196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.7308601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7308833Z 2025-05-07T20:31:59.7309050Z self = 2025-05-07T20:31:59.7310113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.7311482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978e8e00>} 2025-05-07T20:31:59.7312822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.7313845Z context = 2025-05-07T20:31:59.7314131Z 2025-05-07T20:31:59.7314300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.7314829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.7315303Z module_map=module_map) 2025-05-07T20:31:59.7315677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.7316030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.7316304Z E ^ 2025-05-07T20:31:59.7316769Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.7317213Z 2025-05-07T20:31:59.7317624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.7318148Z 2025-05-07T20:31:59.7318255Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.7318692Z self=, 2025-05-07T20:31:59.7319104Z T=1, 2025-05-07T20:31:59.7319303Z D=7168, 2025-05-07T20:31:59.7319499Z scale_ub=1200.0, 2025-05-07T20:31:59.7319734Z contiguous=False, 2025-05-07T20:31:59.7319970Z compiled=False, 2025-05-07T20:31:59.7320177Z ) 2025-05-07T20:31:59.7320505Z self = 2025-05-07T20:31:59.7321000Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:31:59.7321270Z 2025-05-07T20:31:59.7321352Z @given( 2025-05-07T20:31:59.7321595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.7321920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.7322317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.7322657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.7322997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.7323293Z ) 2025-05-07T20:31:59.7323640Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.7324092Z def test_silu_mul_quant( 2025-05-07T20:31:59.7324344Z self, 2025-05-07T20:31:59.7324542Z T: int, 2025-05-07T20:31:59.7324749Z D: int, 2025-05-07T20:31:59.7324977Z scale_ub: Optional[float], 2025-05-07T20:31:59.7325250Z contiguous: bool, 2025-05-07T20:31:59.7325500Z compiled: bool, 2025-05-07T20:31:59.7325730Z ) -> None: 2025-05-07T20:31:59.7326053Z torch.manual_seed(2025) 2025-05-07T20:31:59.7326304Z 2025-05-07T20:31:59.7326582Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.7326922Z 2025-05-07T20:31:59.7327133Z x_sign = torch.sign(x) 2025-05-07T20:31:59.7327430Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.7327749Z x = x_sign * x_clamp 2025-05-07T20:31:59.7327987Z x0 = x[:, :D] 2025-05-07T20:31:59.7328212Z x1 = x[:, D:] 2025-05-07T20:31:59.7328430Z 2025-05-07T20:31:59.7328618Z if contiguous: 2025-05-07T20:31:59.7328857Z x0 = x0.contiguous() 2025-05-07T20:31:59.7329125Z x1 = x1.contiguous() 2025-05-07T20:31:59.7329369Z 2025-05-07T20:31:59.7329572Z if scale_ub is not None: 2025-05-07T20:31:59.7329854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.7330190Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.7330514Z ) 2025-05-07T20:31:59.7330720Z else: 2025-05-07T20:31:59.7330935Z scale_ub_tensor = None 2025-05-07T20:31:59.7331196Z 2025-05-07T20:31:59.7331441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.7331763Z op = silu_mul_quant 2025-05-07T20:31:59.7332026Z if compiled: 2025-05-07T20:31:59.7332282Z op = torch.compile(op) 2025-05-07T20:31:59.7332590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7332867Z 2025-05-07T20:31:59.7333072Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.7333237Z 2025-05-07T20:31:59.7333347Z moe/activation_test.py:117: 2025-05-07T20:31:59.7333720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7334061Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.7334348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.7335033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.7335718Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.7336256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.7336936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.7337590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.7338122Z kernel = self.compile( 2025-05-07T20:31:59.7338663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.7339307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.7339708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.7339947Z 2025-05-07T20:31:59.7340157Z self = 2025-05-07T20:31:59.7341222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.7342648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978ea160>} 2025-05-07T20:31:59.7343963Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.7344973Z context = 2025-05-07T20:31:59.7345266Z 2025-05-07T20:31:59.7345431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.7346024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.7346499Z module_map=module_map) 2025-05-07T20:31:59.7346866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.7347230Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.7347497Z E ^ 2025-05-07T20:31:59.7347958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.7348399Z 2025-05-07T20:31:59.7348818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.7349321Z 2025-05-07T20:31:59.7349426Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.7349839Z self=, 2025-05-07T20:31:59.7350242Z T=4096, 2025-05-07T20:31:59.7350428Z D=7168, 2025-05-07T20:31:59.7350633Z scale_ub=1200.0, 2025-05-07T20:31:59.7350863Z contiguous=False, 2025-05-07T20:31:59.7351084Z compiled=True, 2025-05-07T20:31:59.8950993Z ) 2025-05-07T20:31:59.8951527Z self = 2025-05-07T20:31:59.8952120Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:59.8952398Z 2025-05-07T20:31:59.8952486Z @given( 2025-05-07T20:31:59.8952717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.8953034Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.8953345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.8953669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.8954001Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.8954290Z ) 2025-05-07T20:31:59.8954632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.8955075Z def test_silu_mul_quant( 2025-05-07T20:31:59.8955330Z self, 2025-05-07T20:31:59.8955522Z T: int, 2025-05-07T20:31:59.8955725Z D: int, 2025-05-07T20:31:59.8955946Z scale_ub: Optional[float], 2025-05-07T20:31:59.8956216Z contiguous: bool, 2025-05-07T20:31:59.8956468Z compiled: bool, 2025-05-07T20:31:59.8956699Z ) -> None: 2025-05-07T20:31:59.8956920Z torch.manual_seed(2025) 2025-05-07T20:31:59.8957157Z 2025-05-07T20:31:59.8957433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.8957778Z 2025-05-07T20:31:59.8957968Z x_sign = torch.sign(x) 2025-05-07T20:31:59.8958262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.8958576Z x = x_sign * x_clamp 2025-05-07T20:31:59.8958810Z x0 = x[:, :D] 2025-05-07T20:31:59.8959034Z x1 = x[:, D:] 2025-05-07T20:31:59.8959252Z 2025-05-07T20:31:59.8959435Z if contiguous: 2025-05-07T20:31:59.8959673Z x0 = x0.contiguous() 2025-05-07T20:31:59.8959943Z x1 = x1.contiguous() 2025-05-07T20:31:59.8960178Z 2025-05-07T20:31:59.8960380Z if scale_ub is not None: 2025-05-07T20:31:59.8960656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.8961338Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.8961653Z ) 2025-05-07T20:31:59.8961853Z else: 2025-05-07T20:31:59.8962071Z scale_ub_tensor = None 2025-05-07T20:31:59.8962321Z 2025-05-07T20:31:59.8962558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.8962878Z op = silu_mul_quant 2025-05-07T20:31:59.8963122Z if compiled: 2025-05-07T20:31:59.8963375Z op = torch.compile(op) 2025-05-07T20:31:59.8963670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.8963935Z 2025-05-07T20:31:59.8964129Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.8964324Z 2025-05-07T20:31:59.8964561Z moe/activation_test.py:117: 2025-05-07T20:31:59.8964862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.8965189Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.8965467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.8966026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.8966578Z return fn(*args, **kwargs) 2025-05-07T20:31:59.8967218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.8967892Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.8968427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.8969090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.8969747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.8970270Z kernel = self.compile( 2025-05-07T20:31:59.8970801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.8971451Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.8971846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.8972067Z 2025-05-07T20:31:59.8972276Z self = 2025-05-07T20:31:59.8973332Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.8974804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978eb420>} 2025-05-07T20:31:59.8976120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.8977125Z context = 2025-05-07T20:31:59.8977402Z 2025-05-07T20:31:59.8977574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.8978074Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.8978533Z module_map=module_map) 2025-05-07T20:31:59.8978893Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.8979240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.8979492Z E ^ 2025-05-07T20:31:59.8979948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.8980386Z 2025-05-07T20:31:59.8980803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.8981394Z 2025-05-07T20:31:59.8981504Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.8981904Z self=, 2025-05-07T20:31:59.8982302Z T=128, 2025-05-07T20:31:59.8982493Z D=7168, 2025-05-07T20:31:59.8982684Z scale_ub=1200.0, 2025-05-07T20:31:59.8982910Z contiguous=False, 2025-05-07T20:31:59.8983137Z compiled=True, 2025-05-07T20:31:59.8983337Z ) 2025-05-07T20:31:59.8983657Z self = 2025-05-07T20:31:59.8984143Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:31:59.8984409Z 2025-05-07T20:31:59.8984487Z @given( 2025-05-07T20:31:59.8984804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:31:59.8985118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:31:59.8985423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:31:59.8985743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:31:59.8986077Z compiled=st.sampled_from([True, False]), 2025-05-07T20:31:59.8986362Z ) 2025-05-07T20:31:59.8986703Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:31:59.8987138Z def test_silu_mul_quant( 2025-05-07T20:31:59.8987375Z self, 2025-05-07T20:31:59.8987566Z T: int, 2025-05-07T20:31:59.8987764Z D: int, 2025-05-07T20:31:59.8987979Z scale_ub: Optional[float], 2025-05-07T20:31:59.8988240Z contiguous: bool, 2025-05-07T20:31:59.8988475Z compiled: bool, 2025-05-07T20:31:59.8988698Z ) -> None: 2025-05-07T20:31:59.8988904Z torch.manual_seed(2025) 2025-05-07T20:31:59.8989145Z 2025-05-07T20:31:59.8989416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:31:59.8989747Z 2025-05-07T20:31:59.8989941Z x_sign = torch.sign(x) 2025-05-07T20:31:59.8990230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:31:59.8990542Z x = x_sign * x_clamp 2025-05-07T20:31:59.8990776Z x0 = x[:, :D] 2025-05-07T20:31:59.8990986Z x1 = x[:, D:] 2025-05-07T20:31:59.8991198Z 2025-05-07T20:31:59.8991376Z if contiguous: 2025-05-07T20:31:59.8991608Z x0 = x0.contiguous() 2025-05-07T20:31:59.8991863Z x1 = x1.contiguous() 2025-05-07T20:31:59.8992098Z 2025-05-07T20:31:59.8992292Z if scale_ub is not None: 2025-05-07T20:31:59.8992561Z scale_ub_tensor = torch.tensor( 2025-05-07T20:31:59.8992888Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:31:59.8993195Z ) 2025-05-07T20:31:59.8993383Z else: 2025-05-07T20:31:59.8993587Z scale_ub_tensor = None 2025-05-07T20:31:59.8993841Z 2025-05-07T20:31:59.8994071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:31:59.8994378Z op = silu_mul_quant 2025-05-07T20:31:59.8994625Z if compiled: 2025-05-07T20:31:59.8994876Z op = torch.compile(op) 2025-05-07T20:31:59.8995172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.8995460Z 2025-05-07T20:31:59.8995645Z > y_fp8, y_scale = fn() 2025-05-07T20:31:59.8995814Z 2025-05-07T20:31:59.8995910Z moe/activation_test.py:117: 2025-05-07T20:31:59.8996200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.8996532Z moe/activation_test.py:115: in fn 2025-05-07T20:31:59.8996811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:31:59.8997358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:31:59.8997910Z return fn(*args, **kwargs) 2025-05-07T20:31:59.9012285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:31:59.9013134Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:31:59.9013758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:31:59.9014642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:31:59.9015306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:31:59.9015834Z kernel = self.compile( 2025-05-07T20:31:59.9016378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:31:59.9017032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:31:59.9017427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:31:59.9017835Z 2025-05-07T20:31:59.9018046Z self = 2025-05-07T20:31:59.9019107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:31:59.9020463Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93975cc720>} 2025-05-07T20:31:59.9021796Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:31:59.9022800Z context = 2025-05-07T20:31:59.9023087Z 2025-05-07T20:31:59.9023258Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:31:59.9023776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:31:59.9024244Z module_map=module_map) 2025-05-07T20:31:59.9024616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:31:59.9024974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:31:59.9025239Z E ^ 2025-05-07T20:31:59.9025695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:31:59.9026146Z 2025-05-07T20:31:59.9026553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:31:59.9027069Z 2025-05-07T20:31:59.9027172Z Trying example: test_silu_mul_quant( 2025-05-07T20:31:59.9027586Z self=, 2025-05-07T20:31:59.9027979Z T=2048, 2025-05-07T20:31:59.9028178Z D=7168, 2025-05-07T20:31:59.9028378Z scale_ub=None, 2025-05-07T20:31:59.9028589Z contiguous=True, 2025-05-07T20:31:59.9028821Z compiled=True, 2025-05-07T20:32:00.0258261Z ) 2025-05-07T20:32:00.0258740Z self = 2025-05-07T20:32:00.0259356Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.0259631Z 2025-05-07T20:32:00.0259730Z @given( 2025-05-07T20:32:00.0259966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0260284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0260593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0260927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0261251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0261549Z ) 2025-05-07T20:32:00.0261900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0262349Z def test_silu_mul_quant( 2025-05-07T20:32:00.0262593Z self, 2025-05-07T20:32:00.0262796Z T: int, 2025-05-07T20:32:00.0262993Z D: int, 2025-05-07T20:32:00.0263215Z scale_ub: Optional[float], 2025-05-07T20:32:00.0263827Z contiguous: bool, 2025-05-07T20:32:00.0264065Z compiled: bool, 2025-05-07T20:32:00.0264296Z ) -> None: 2025-05-07T20:32:00.0264514Z torch.manual_seed(2025) 2025-05-07T20:32:00.0264753Z 2025-05-07T20:32:00.0265031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0265382Z 2025-05-07T20:32:00.0265583Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0265870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0266183Z x = x_sign * x_clamp 2025-05-07T20:32:00.0266429Z x0 = x[:, :D] 2025-05-07T20:32:00.0266644Z x1 = x[:, D:] 2025-05-07T20:32:00.0266863Z 2025-05-07T20:32:00.0267063Z if contiguous: 2025-05-07T20:32:00.0267446Z x0 = x0.contiguous() 2025-05-07T20:32:00.0267713Z x1 = x1.contiguous() 2025-05-07T20:32:00.0267958Z 2025-05-07T20:32:00.0268150Z if scale_ub is not None: 2025-05-07T20:32:00.0268436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.0268782Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.0269091Z ) 2025-05-07T20:32:00.0269291Z else: 2025-05-07T20:32:00.0269511Z scale_ub_tensor = None 2025-05-07T20:32:00.0269760Z 2025-05-07T20:32:00.0269993Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.0270316Z op = silu_mul_quant 2025-05-07T20:32:00.0270569Z if compiled: 2025-05-07T20:32:00.0270812Z op = torch.compile(op) 2025-05-07T20:32:00.0271111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0271391Z 2025-05-07T20:32:00.0271583Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.0271756Z 2025-05-07T20:32:00.0271857Z moe/activation_test.py:117: 2025-05-07T20:32:00.0272158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0272488Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.0272777Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.0273340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.0273900Z return fn(*args, **kwargs) 2025-05-07T20:32:00.0274550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.0275236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.0275775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.0276444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.0277115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.0277649Z kernel = self.compile( 2025-05-07T20:32:00.0278190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.0278844Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.0279244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.0279474Z 2025-05-07T20:32:00.0279686Z self = 2025-05-07T20:32:00.0280755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.0282120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93975cd440>} 2025-05-07T20:32:00.0283447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.0284543Z context = 2025-05-07T20:32:00.0284825Z 2025-05-07T20:32:00.0284999Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.0285513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.0285985Z module_map=module_map) 2025-05-07T20:32:00.0286356Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.0286714Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.0286973Z E ^ 2025-05-07T20:32:00.0287514Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.0287961Z 2025-05-07T20:32:00.0288378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.0288891Z 2025-05-07T20:32:00.0288996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0289410Z self=, 2025-05-07T20:32:00.0289816Z T=16384, 2025-05-07T20:32:00.0290017Z D=5120, 2025-05-07T20:32:00.0290211Z scale_ub=None, 2025-05-07T20:32:00.0290434Z contiguous=False, 2025-05-07T20:32:00.0290673Z compiled=False, 2025-05-07T20:32:00.0290882Z ) 2025-05-07T20:32:00.0291203Z self = 2025-05-07T20:32:00.0291704Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.0291979Z 2025-05-07T20:32:00.0292068Z @given( 2025-05-07T20:32:00.0292304Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0292620Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0292924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0293263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0293592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0294000Z ) 2025-05-07T20:32:00.0294345Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0294788Z def test_silu_mul_quant( 2025-05-07T20:32:00.0295034Z self, 2025-05-07T20:32:00.0295228Z T: int, 2025-05-07T20:32:00.0295430Z D: int, 2025-05-07T20:32:00.0295652Z scale_ub: Optional[float], 2025-05-07T20:32:00.0295921Z contiguous: bool, 2025-05-07T20:32:00.0296166Z compiled: bool, 2025-05-07T20:32:00.0296395Z ) -> None: 2025-05-07T20:32:00.0296607Z torch.manual_seed(2025) 2025-05-07T20:32:00.0296859Z 2025-05-07T20:32:00.0297134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0297473Z 2025-05-07T20:32:00.0297672Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0297969Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0300193Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.0302025Z 2025-05-07T20:32:00.0302142Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.0302357Z 2025-05-07T20:32:00.0302466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0302864Z self=, 2025-05-07T20:32:00.0303267Z T=4096, 2025-05-07T20:32:00.0303593Z D=7168, 2025-05-07T20:32:00.0303782Z scale_ub=1200.0, 2025-05-07T20:32:00.0304007Z contiguous=True, 2025-05-07T20:32:00.0304226Z compiled=True, 2025-05-07T20:32:00.0304424Z ) 2025-05-07T20:32:00.0304743Z self = 2025-05-07T20:32:00.0305233Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.0305500Z 2025-05-07T20:32:00.0305588Z @given( 2025-05-07T20:32:00.0305814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0306125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0306431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0306900Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0307232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0307523Z ) 2025-05-07T20:32:00.0307862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0308311Z def test_silu_mul_quant( 2025-05-07T20:32:00.0308549Z self, 2025-05-07T20:32:00.0308735Z T: int, 2025-05-07T20:32:00.0308934Z D: int, 2025-05-07T20:32:00.0309154Z scale_ub: Optional[float], 2025-05-07T20:32:00.0309421Z contiguous: bool, 2025-05-07T20:32:00.0309655Z compiled: bool, 2025-05-07T20:32:00.0309878Z ) -> None: 2025-05-07T20:32:00.0310092Z torch.manual_seed(2025) 2025-05-07T20:32:00.0310322Z 2025-05-07T20:32:00.0310589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0310923Z 2025-05-07T20:32:00.0311111Z x_sign = torch.sign(x) 2025-05-07T20:32:00.0311404Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.0313355Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.0315170Z 2025-05-07T20:32:00.0315293Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.0315499Z 2025-05-07T20:32:00.0315608Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.0316020Z self=, 2025-05-07T20:32:00.0316416Z T=16384, 2025-05-07T20:32:00.0316606Z D=7168, 2025-05-07T20:32:00.0316799Z scale_ub=None, 2025-05-07T20:32:00.0317013Z contiguous=False, 2025-05-07T20:32:00.0317237Z compiled=False, 2025-05-07T20:32:00.0317431Z ) 2025-05-07T20:32:00.0317745Z self = 2025-05-07T20:32:00.0318236Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.0318505Z 2025-05-07T20:32:00.0318590Z @given( 2025-05-07T20:32:00.0318812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.0319122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.0319427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.0319743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.0320068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.0320352Z ) 2025-05-07T20:32:00.0320689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.0321130Z def test_silu_mul_quant( 2025-05-07T20:32:00.0321372Z self, 2025-05-07T20:32:00.0321567Z T: int, 2025-05-07T20:32:00.0321768Z D: int, 2025-05-07T20:32:00.0321985Z scale_ub: Optional[float], 2025-05-07T20:32:00.0322341Z contiguous: bool, 2025-05-07T20:32:00.0322580Z compiled: bool, 2025-05-07T20:32:00.0322804Z ) -> None: 2025-05-07T20:32:00.0323020Z torch.manual_seed(2025) 2025-05-07T20:32:00.0323254Z 2025-05-07T20:32:00.0323520Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.0325591Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.0327405Z 2025-05-07T20:32:00.0327528Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.1563453Z 2025-05-07T20:32:00.1564121Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1564755Z self=, 2025-05-07T20:32:00.1565769Z T=2048, 2025-05-07T20:32:00.1566233Z D=7168, 2025-05-07T20:32:00.1566615Z scale_ub=1200.0, 2025-05-07T20:32:00.1567059Z contiguous=True, 2025-05-07T20:32:00.1567481Z compiled=True, 2025-05-07T20:32:00.1567893Z ) 2025-05-07T20:32:00.1568533Z self = 2025-05-07T20:32:00.1569517Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.1570066Z 2025-05-07T20:32:00.1570227Z @given( 2025-05-07T20:32:00.1570725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.1571345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.1571939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.1572604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.1573254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.1574005Z ) 2025-05-07T20:32:00.1574692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.1575214Z def test_silu_mul_quant( 2025-05-07T20:32:00.1575446Z self, 2025-05-07T20:32:00.1575648Z T: int, 2025-05-07T20:32:00.1575851Z D: int, 2025-05-07T20:32:00.1576078Z scale_ub: Optional[float], 2025-05-07T20:32:00.1576353Z contiguous: bool, 2025-05-07T20:32:00.1576592Z compiled: bool, 2025-05-07T20:32:00.1576814Z ) -> None: 2025-05-07T20:32:00.1577033Z torch.manual_seed(2025) 2025-05-07T20:32:00.1577275Z 2025-05-07T20:32:00.1577548Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.1577887Z 2025-05-07T20:32:00.1578085Z x_sign = torch.sign(x) 2025-05-07T20:32:00.1578374Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.1580346Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.1582188Z 2025-05-07T20:32:00.1582315Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:00.1582524Z 2025-05-07T20:32:00.1582629Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1583035Z self=, 2025-05-07T20:32:00.1583464Z T=2048, 2025-05-07T20:32:00.1583655Z D=7168, 2025-05-07T20:32:00.1584227Z scale_ub=None, 2025-05-07T20:32:00.1584439Z contiguous=True, 2025-05-07T20:32:00.1584659Z compiled=False, 2025-05-07T20:32:00.1584862Z ) 2025-05-07T20:32:00.1585174Z self = 2025-05-07T20:32:00.1585658Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.1585923Z 2025-05-07T20:32:00.1586007Z @given( 2025-05-07T20:32:00.1586234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.1586548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.1586850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.1587178Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.1587654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.1587944Z ) 2025-05-07T20:32:00.1588292Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.1588735Z def test_silu_mul_quant( 2025-05-07T20:32:00.1588981Z self, 2025-05-07T20:32:00.1589181Z T: int, 2025-05-07T20:32:00.1589373Z D: int, 2025-05-07T20:32:00.1589591Z scale_ub: Optional[float], 2025-05-07T20:32:00.1589864Z contiguous: bool, 2025-05-07T20:32:00.1590101Z compiled: bool, 2025-05-07T20:32:00.1590324Z ) -> None: 2025-05-07T20:32:00.1590543Z torch.manual_seed(2025) 2025-05-07T20:32:00.1590780Z 2025-05-07T20:32:00.1591047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.1591388Z 2025-05-07T20:32:00.1591585Z > x_sign = torch.sign(x) 2025-05-07T20:32:00.1593479Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.1595288Z 2025-05-07T20:32:00.1595403Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:00.1595620Z 2025-05-07T20:32:00.1595721Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1596128Z self=, 2025-05-07T20:32:00.1596520Z T=1, 2025-05-07T20:32:00.1596706Z D=7168, 2025-05-07T20:32:00.1596900Z scale_ub=1200.0, 2025-05-07T20:32:00.1597121Z contiguous=True, 2025-05-07T20:32:00.1597343Z compiled=False, 2025-05-07T20:32:00.1597550Z ) 2025-05-07T20:32:00.1597863Z self = 2025-05-07T20:32:00.1598676Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.1598948Z 2025-05-07T20:32:00.1599028Z @given( 2025-05-07T20:32:00.1599257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.1599565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.1599868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.1600193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.1600511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.1600797Z ) 2025-05-07T20:32:00.1601141Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.1601585Z def test_silu_mul_quant( 2025-05-07T20:32:00.1601821Z self, 2025-05-07T20:32:00.1602016Z T: int, 2025-05-07T20:32:00.1602215Z D: int, 2025-05-07T20:32:00.1602426Z scale_ub: Optional[float], 2025-05-07T20:32:00.1602691Z contiguous: bool, 2025-05-07T20:32:00.1602928Z compiled: bool, 2025-05-07T20:32:00.1603144Z ) -> None: 2025-05-07T20:32:00.1603500Z torch.manual_seed(2025) 2025-05-07T20:32:00.1603740Z 2025-05-07T20:32:00.1604001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.1604344Z 2025-05-07T20:32:00.1604536Z x_sign = torch.sign(x) 2025-05-07T20:32:00.1604819Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.1605132Z x = x_sign * x_clamp 2025-05-07T20:32:00.1605373Z x0 = x[:, :D] 2025-05-07T20:32:00.1605585Z x1 = x[:, D:] 2025-05-07T20:32:00.1605797Z 2025-05-07T20:32:00.1605986Z if contiguous: 2025-05-07T20:32:00.1606213Z x0 = x0.contiguous() 2025-05-07T20:32:00.1606471Z x1 = x1.contiguous() 2025-05-07T20:32:00.1606711Z 2025-05-07T20:32:00.1607023Z if scale_ub is not None: 2025-05-07T20:32:00.1607294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.1607625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.1607938Z ) 2025-05-07T20:32:00.1608131Z else: 2025-05-07T20:32:00.1608345Z scale_ub_tensor = None 2025-05-07T20:32:00.1608598Z 2025-05-07T20:32:00.1608826Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.1609142Z op = silu_mul_quant 2025-05-07T20:32:00.1609391Z if compiled: 2025-05-07T20:32:00.1609636Z op = torch.compile(op) 2025-05-07T20:32:00.1609936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1610210Z 2025-05-07T20:32:00.1610396Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.1610564Z 2025-05-07T20:32:00.1610664Z moe/activation_test.py:117: 2025-05-07T20:32:00.1610964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1611292Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.1611566Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1612253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.1612940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.1613464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.1614263Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.1614918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.1615448Z kernel = self.compile( 2025-05-07T20:32:00.1615977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.1616630Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.1617028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1617253Z 2025-05-07T20:32:00.1617462Z self = 2025-05-07T20:32:00.1618522Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.1619866Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973f8400>} 2025-05-07T20:32:00.1621179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.1622191Z context = 2025-05-07T20:32:00.1622473Z 2025-05-07T20:32:00.1622639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.1623264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.1623724Z module_map=module_map) 2025-05-07T20:32:00.1624086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.1624428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.1624688Z E ^ 2025-05-07T20:32:00.1625145Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.1625584Z 2025-05-07T20:32:00.1625988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.1626501Z 2025-05-07T20:32:00.1626683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.1627104Z self=, 2025-05-07T20:32:00.1627503Z T=128, 2025-05-07T20:32:00.1627696Z D=5120, 2025-05-07T20:32:00.1627894Z scale_ub=None, 2025-05-07T20:32:00.1628101Z contiguous=True, 2025-05-07T20:32:00.1628326Z compiled=False, 2025-05-07T20:32:00.1628530Z ) 2025-05-07T20:32:00.1628843Z self = 2025-05-07T20:32:00.1629323Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.1629586Z 2025-05-07T20:32:00.1629677Z @given( 2025-05-07T20:32:00.1629906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.1639799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.1640131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.1640457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.1640799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.1641090Z ) 2025-05-07T20:32:00.1641438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.1641887Z def test_silu_mul_quant( 2025-05-07T20:32:00.1642140Z self, 2025-05-07T20:32:00.1642333Z T: int, 2025-05-07T20:32:00.1642534Z D: int, 2025-05-07T20:32:00.1642758Z scale_ub: Optional[float], 2025-05-07T20:32:00.1643027Z contiguous: bool, 2025-05-07T20:32:00.1643270Z compiled: bool, 2025-05-07T20:32:00.1643498Z ) -> None: 2025-05-07T20:32:00.1643710Z torch.manual_seed(2025) 2025-05-07T20:32:00.1643954Z 2025-05-07T20:32:00.1644233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.1644579Z 2025-05-07T20:32:00.1644768Z x_sign = torch.sign(x) 2025-05-07T20:32:00.1645060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.1645379Z x = x_sign * x_clamp 2025-05-07T20:32:00.1645617Z x0 = x[:, :D] 2025-05-07T20:32:00.1645838Z x1 = x[:, D:] 2025-05-07T20:32:00.1646050Z 2025-05-07T20:32:00.1646229Z if contiguous: 2025-05-07T20:32:00.1646459Z x0 = x0.contiguous() 2025-05-07T20:32:00.1646719Z x1 = x1.contiguous() 2025-05-07T20:32:00.1646950Z 2025-05-07T20:32:00.1647140Z if scale_ub is not None: 2025-05-07T20:32:00.1647402Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.1647723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.1648031Z ) 2025-05-07T20:32:00.1648225Z else: 2025-05-07T20:32:00.1648428Z scale_ub_tensor = None 2025-05-07T20:32:00.1648681Z 2025-05-07T20:32:00.1648915Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.1649218Z op = silu_mul_quant 2025-05-07T20:32:00.1649466Z if compiled: 2025-05-07T20:32:00.1649709Z op = torch.compile(op) 2025-05-07T20:32:00.1650005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1650270Z 2025-05-07T20:32:00.1650454Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.1650612Z 2025-05-07T20:32:00.1650709Z moe/activation_test.py:117: 2025-05-07T20:32:00.1651119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1651449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.1651721Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.1652399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.1653075Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.1653596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.1654348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.1655083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.1655605Z kernel = self.compile( 2025-05-07T20:32:00.1656130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.1656788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.1657184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.1657408Z 2025-05-07T20:32:00.1657621Z self = 2025-05-07T20:32:00.1658677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.1660026Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973f9300>} 2025-05-07T20:32:00.1661344Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.1662349Z context = 2025-05-07T20:32:00.1662626Z 2025-05-07T20:32:00.1662789Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.1663297Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.1663752Z module_map=module_map) 2025-05-07T20:32:00.1664110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.1664449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.1664703Z E ^ 2025-05-07T20:32:00.1665159Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.1665593Z 2025-05-07T20:32:00.1666000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2785477Z 2025-05-07T20:32:00.2786224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2786880Z self=, 2025-05-07T20:32:00.2787417Z T=128, 2025-05-07T20:32:00.2787605Z D=7168, 2025-05-07T20:32:00.2787800Z scale_ub=None, 2025-05-07T20:32:00.2788014Z contiguous=True, 2025-05-07T20:32:00.2788238Z compiled=False, 2025-05-07T20:32:00.2788438Z ) 2025-05-07T20:32:00.2788756Z self = 2025-05-07T20:32:00.2789243Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.2789505Z 2025-05-07T20:32:00.2789583Z @given( 2025-05-07T20:32:00.2789831Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2790143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2790440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2791153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2791478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2791754Z ) 2025-05-07T20:32:00.2792099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2792540Z def test_silu_mul_quant( 2025-05-07T20:32:00.2792786Z self, 2025-05-07T20:32:00.2792980Z T: int, 2025-05-07T20:32:00.2793181Z D: int, 2025-05-07T20:32:00.2793401Z scale_ub: Optional[float], 2025-05-07T20:32:00.2793667Z contiguous: bool, 2025-05-07T20:32:00.2793905Z compiled: bool, 2025-05-07T20:32:00.2794136Z ) -> None: 2025-05-07T20:32:00.2794345Z torch.manual_seed(2025) 2025-05-07T20:32:00.2794586Z 2025-05-07T20:32:00.2795016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2795355Z 2025-05-07T20:32:00.2795549Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2795839Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2796149Z x = x_sign * x_clamp 2025-05-07T20:32:00.2796390Z x0 = x[:, :D] 2025-05-07T20:32:00.2796613Z x1 = x[:, D:] 2025-05-07T20:32:00.2796814Z 2025-05-07T20:32:00.2796999Z if contiguous: 2025-05-07T20:32:00.2797224Z x0 = x0.contiguous() 2025-05-07T20:32:00.2797479Z x1 = x1.contiguous() 2025-05-07T20:32:00.2797716Z 2025-05-07T20:32:00.2797908Z if scale_ub is not None: 2025-05-07T20:32:00.2798467Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2798796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2799104Z ) 2025-05-07T20:32:00.2799295Z else: 2025-05-07T20:32:00.2799502Z scale_ub_tensor = None 2025-05-07T20:32:00.2799755Z 2025-05-07T20:32:00.2799982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2800290Z op = silu_mul_quant 2025-05-07T20:32:00.2800545Z if compiled: 2025-05-07T20:32:00.2800796Z op = torch.compile(op) 2025-05-07T20:32:00.2801089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2801367Z 2025-05-07T20:32:00.2801562Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.2801723Z 2025-05-07T20:32:00.2801826Z moe/activation_test.py:117: 2025-05-07T20:32:00.2802116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2802447Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.2802726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2803401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.2804085Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.2804619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2805345Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2805997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2806524Z kernel = self.compile( 2025-05-07T20:32:00.2807060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2807705Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2808099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2808330Z 2025-05-07T20:32:00.2808536Z self = 2025-05-07T20:32:00.2809608Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2811216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973fa0c0>} 2025-05-07T20:32:00.2812529Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2813538Z context = 2025-05-07T20:32:00.2813956Z 2025-05-07T20:32:00.2814120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2814756Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2815214Z module_map=module_map) 2025-05-07T20:32:00.2815579Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2815930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2816197Z E ^ 2025-05-07T20:32:00.2816649Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2817095Z 2025-05-07T20:32:00.2817501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.2818004Z 2025-05-07T20:32:00.2818115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2818516Z self=, 2025-05-07T20:32:00.2818912Z T=2048, 2025-05-07T20:32:00.2819100Z D=7168, 2025-05-07T20:32:00.2819297Z scale_ub=1200.0, 2025-05-07T20:32:00.2819511Z contiguous=True, 2025-05-07T20:32:00.2819739Z compiled=False, 2025-05-07T20:32:00.2819949Z ) 2025-05-07T20:32:00.2820263Z self = 2025-05-07T20:32:00.2820756Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.2821030Z 2025-05-07T20:32:00.2821114Z @given( 2025-05-07T20:32:00.2821341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2821651Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2821957Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2822278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2822601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2822887Z ) 2025-05-07T20:32:00.2823232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2823664Z def test_silu_mul_quant( 2025-05-07T20:32:00.2823904Z self, 2025-05-07T20:32:00.2824098Z T: int, 2025-05-07T20:32:00.2824290Z D: int, 2025-05-07T20:32:00.2824505Z scale_ub: Optional[float], 2025-05-07T20:32:00.2824774Z contiguous: bool, 2025-05-07T20:32:00.2825004Z compiled: bool, 2025-05-07T20:32:00.2825227Z ) -> None: 2025-05-07T20:32:00.2825440Z torch.manual_seed(2025) 2025-05-07T20:32:00.2825673Z 2025-05-07T20:32:00.2825940Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2827959Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.2829770Z 2025-05-07T20:32:00.2829887Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.2830102Z 2025-05-07T20:32:00.2830205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.2830705Z self=, 2025-05-07T20:32:00.2831100Z T=1, 2025-05-07T20:32:00.2831289Z D=5120, 2025-05-07T20:32:00.2831486Z scale_ub=1200.0, 2025-05-07T20:32:00.2831700Z contiguous=True, 2025-05-07T20:32:00.2831920Z compiled=False, 2025-05-07T20:32:00.2832127Z ) 2025-05-07T20:32:00.2832438Z self = 2025-05-07T20:32:00.2832920Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.2833180Z 2025-05-07T20:32:00.2833264Z @given( 2025-05-07T20:32:00.2833494Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.2833886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.2834191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.2834520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.2834842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.2835133Z ) 2025-05-07T20:32:00.2835477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.2835912Z def test_silu_mul_quant( 2025-05-07T20:32:00.2836154Z self, 2025-05-07T20:32:00.2836348Z T: int, 2025-05-07T20:32:00.2836537Z D: int, 2025-05-07T20:32:00.2836753Z scale_ub: Optional[float], 2025-05-07T20:32:00.2837020Z contiguous: bool, 2025-05-07T20:32:00.2837249Z compiled: bool, 2025-05-07T20:32:00.2837470Z ) -> None: 2025-05-07T20:32:00.2837682Z torch.manual_seed(2025) 2025-05-07T20:32:00.2837916Z 2025-05-07T20:32:00.2838184Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.2838527Z 2025-05-07T20:32:00.2838716Z x_sign = torch.sign(x) 2025-05-07T20:32:00.2839005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.2839314Z x = x_sign * x_clamp 2025-05-07T20:32:00.2839560Z x0 = x[:, :D] 2025-05-07T20:32:00.2839768Z x1 = x[:, D:] 2025-05-07T20:32:00.2839977Z 2025-05-07T20:32:00.2840164Z if contiguous: 2025-05-07T20:32:00.2840386Z x0 = x0.contiguous() 2025-05-07T20:32:00.2840642Z x1 = x1.contiguous() 2025-05-07T20:32:00.2840882Z 2025-05-07T20:32:00.2841068Z if scale_ub is not None: 2025-05-07T20:32:00.2841338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.2841669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.2841980Z ) 2025-05-07T20:32:00.2842180Z else: 2025-05-07T20:32:00.2842385Z scale_ub_tensor = None 2025-05-07T20:32:00.2842636Z 2025-05-07T20:32:00.2842872Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.2843183Z op = silu_mul_quant 2025-05-07T20:32:00.2843435Z if compiled: 2025-05-07T20:32:00.2843687Z op = torch.compile(op) 2025-05-07T20:32:00.2843979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2844252Z 2025-05-07T20:32:00.2844444Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.2844604Z 2025-05-07T20:32:00.2844701Z moe/activation_test.py:117: 2025-05-07T20:32:00.2844997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2845326Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.2845598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.2846280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.2846964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.2847502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.2848168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.2848821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.2849468Z kernel = self.compile( 2025-05-07T20:32:00.2850003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.2850647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.2851039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.2851262Z 2025-05-07T20:32:00.2851471Z self = 2025-05-07T20:32:00.2852614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.2854055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973fb6a0>} 2025-05-07T20:32:00.2855375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.2856379Z context = 2025-05-07T20:32:00.2856659Z 2025-05-07T20:32:00.2856829Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.2857333Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.2857794Z module_map=module_map) 2025-05-07T20:32:00.2858165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.2858517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.2858772Z E ^ 2025-05-07T20:32:00.2859230Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.2859676Z 2025-05-07T20:32:00.2860091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.3687469Z 2025-05-07T20:32:00.3687952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3688883Z self=, 2025-05-07T20:32:00.3689678Z T=2048, 2025-05-07T20:32:00.3690055Z D=5120, 2025-05-07T20:32:00.3690433Z scale_ub=None, 2025-05-07T20:32:00.3690855Z contiguous=True, 2025-05-07T20:32:00.3691289Z compiled=False, 2025-05-07T20:32:00.3691689Z ) 2025-05-07T20:32:00.3692338Z self = 2025-05-07T20:32:00.3693306Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.3694010Z 2025-05-07T20:32:00.3694168Z @given( 2025-05-07T20:32:00.3694677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3695113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3695427Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3695757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3696076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3696369Z ) 2025-05-07T20:32:00.3696717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3697161Z def test_silu_mul_quant( 2025-05-07T20:32:00.3697398Z self, 2025-05-07T20:32:00.3697605Z T: int, 2025-05-07T20:32:00.3697807Z D: int, 2025-05-07T20:32:00.3698021Z scale_ub: Optional[float], 2025-05-07T20:32:00.3698646Z contiguous: bool, 2025-05-07T20:32:00.3698895Z compiled: bool, 2025-05-07T20:32:00.3699113Z ) -> None: 2025-05-07T20:32:00.3699329Z torch.manual_seed(2025) 2025-05-07T20:32:00.3699570Z 2025-05-07T20:32:00.3700132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3700472Z 2025-05-07T20:32:00.3700663Z > x_sign = torch.sign(x) 2025-05-07T20:32:00.3702570Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.3704521Z 2025-05-07T20:32:00.3704648Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:00.3704856Z 2025-05-07T20:32:00.3704958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3705363Z self=, 2025-05-07T20:32:00.3705766Z T=16384, 2025-05-07T20:32:00.3705954Z D=5120, 2025-05-07T20:32:00.3706146Z scale_ub=None, 2025-05-07T20:32:00.3706358Z contiguous=True, 2025-05-07T20:32:00.3706572Z compiled=False, 2025-05-07T20:32:00.3706777Z ) 2025-05-07T20:32:00.3707093Z self = 2025-05-07T20:32:00.3707573Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.3707848Z 2025-05-07T20:32:00.3707926Z @given( 2025-05-07T20:32:00.3708158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3708471Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3708778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3709104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3709428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3709717Z ) 2025-05-07T20:32:00.3710064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3710506Z def test_silu_mul_quant( 2025-05-07T20:32:00.3710737Z self, 2025-05-07T20:32:00.3710931Z T: int, 2025-05-07T20:32:00.3711131Z D: int, 2025-05-07T20:32:00.3711346Z scale_ub: Optional[float], 2025-05-07T20:32:00.3711617Z contiguous: bool, 2025-05-07T20:32:00.3711860Z compiled: bool, 2025-05-07T20:32:00.3712080Z ) -> None: 2025-05-07T20:32:00.3712298Z torch.manual_seed(2025) 2025-05-07T20:32:00.3712541Z 2025-05-07T20:32:00.3712812Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3714802Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.3716663Z 2025-05-07T20:32:00.3716778Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.3716993Z 2025-05-07T20:32:00.3717095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3717498Z self=, 2025-05-07T20:32:00.3717895Z T=4096, 2025-05-07T20:32:00.3718078Z D=5120, 2025-05-07T20:32:00.3718270Z scale_ub=None, 2025-05-07T20:32:00.3718489Z contiguous=True, 2025-05-07T20:32:00.3718706Z compiled=False, 2025-05-07T20:32:00.3718908Z ) 2025-05-07T20:32:00.3719223Z self = 2025-05-07T20:32:00.3719702Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.3720063Z 2025-05-07T20:32:00.3720140Z @given( 2025-05-07T20:32:00.3720368Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3720670Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3720973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3721296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3721618Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3721897Z ) 2025-05-07T20:32:00.3722239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3722677Z def test_silu_mul_quant( 2025-05-07T20:32:00.3722906Z self, 2025-05-07T20:32:00.3723179Z T: int, 2025-05-07T20:32:00.3723374Z D: int, 2025-05-07T20:32:00.3723582Z scale_ub: Optional[float], 2025-05-07T20:32:00.3723854Z contiguous: bool, 2025-05-07T20:32:00.3724089Z compiled: bool, 2025-05-07T20:32:00.3724309Z ) -> None: 2025-05-07T20:32:00.3724521Z torch.manual_seed(2025) 2025-05-07T20:32:00.3724762Z 2025-05-07T20:32:00.3725039Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3727053Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.3728861Z 2025-05-07T20:32:00.3728977Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.3729192Z 2025-05-07T20:32:00.3729293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3729703Z self=, 2025-05-07T20:32:00.3730093Z T=2048, 2025-05-07T20:32:00.3730280Z D=5120, 2025-05-07T20:32:00.3730468Z scale_ub=None, 2025-05-07T20:32:00.3730675Z contiguous=False, 2025-05-07T20:32:00.3730902Z compiled=False, 2025-05-07T20:32:00.3731108Z ) 2025-05-07T20:32:00.3731417Z self = 2025-05-07T20:32:00.3731925Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.3732193Z 2025-05-07T20:32:00.3732273Z @given( 2025-05-07T20:32:00.3732503Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3732819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3733117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3733443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3733865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3734149Z ) 2025-05-07T20:32:00.3734485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3734924Z def test_silu_mul_quant( 2025-05-07T20:32:00.3745002Z self, 2025-05-07T20:32:00.3745244Z T: int, 2025-05-07T20:32:00.3745452Z D: int, 2025-05-07T20:32:00.3745669Z scale_ub: Optional[float], 2025-05-07T20:32:00.3745947Z contiguous: bool, 2025-05-07T20:32:00.3746192Z compiled: bool, 2025-05-07T20:32:00.3746415Z ) -> None: 2025-05-07T20:32:00.3746642Z torch.manual_seed(2025) 2025-05-07T20:32:00.3746889Z 2025-05-07T20:32:00.3747169Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3749185Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.3751124Z 2025-05-07T20:32:00.3751245Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.3751457Z 2025-05-07T20:32:00.3751570Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3751982Z self=, 2025-05-07T20:32:00.3752377Z T=4096, 2025-05-07T20:32:00.3752575Z D=7168, 2025-05-07T20:32:00.3752859Z scale_ub=None, 2025-05-07T20:32:00.3753074Z contiguous=True, 2025-05-07T20:32:00.3753306Z compiled=True, 2025-05-07T20:32:00.3753524Z ) 2025-05-07T20:32:00.3753840Z self = 2025-05-07T20:32:00.3754346Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:00.3754611Z 2025-05-07T20:32:00.3754704Z @given( 2025-05-07T20:32:00.3754933Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.3755251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.3755558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.3755894Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.3756216Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.3756511Z ) 2025-05-07T20:32:00.3756867Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.3757310Z def test_silu_mul_quant( 2025-05-07T20:32:00.3757557Z self, 2025-05-07T20:32:00.3757756Z T: int, 2025-05-07T20:32:00.3757951Z D: int, 2025-05-07T20:32:00.3758176Z scale_ub: Optional[float], 2025-05-07T20:32:00.3758458Z contiguous: bool, 2025-05-07T20:32:00.3758694Z compiled: bool, 2025-05-07T20:32:00.3758919Z ) -> None: 2025-05-07T20:32:00.3759138Z torch.manual_seed(2025) 2025-05-07T20:32:00.3759374Z 2025-05-07T20:32:00.3759647Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.3761648Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.3763460Z 2025-05-07T20:32:00.3763577Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.3763792Z 2025-05-07T20:32:00.3763900Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.3764298Z self=, 2025-05-07T20:32:00.3764701Z T=2048, 2025-05-07T20:32:00.3764885Z D=5120, 2025-05-07T20:32:00.3765104Z scale_ub=1200.0, 2025-05-07T20:32:00.3765351Z contiguous=False, 2025-05-07T20:32:00.3765571Z compiled=False, 2025-05-07T20:32:00.4310569Z ) 2025-05-07T20:32:00.4311071Z self = 2025-05-07T20:32:00.4311761Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:00.4312120Z 2025-05-07T20:32:00.4312208Z @given( 2025-05-07T20:32:00.4312453Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4312773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4313090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4313687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4314052Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4314444Z ) 2025-05-07T20:32:00.4314837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4315278Z def test_silu_mul_quant( 2025-05-07T20:32:00.4315529Z self, 2025-05-07T20:32:00.4315734Z T: int, 2025-05-07T20:32:00.4315931Z D: int, 2025-05-07T20:32:00.4316156Z scale_ub: Optional[float], 2025-05-07T20:32:00.4316435Z contiguous: bool, 2025-05-07T20:32:00.4316676Z compiled: bool, 2025-05-07T20:32:00.4316918Z ) -> None: 2025-05-07T20:32:00.4317142Z torch.manual_seed(2025) 2025-05-07T20:32:00.4317386Z 2025-05-07T20:32:00.4317809Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4319828Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4321654Z 2025-05-07T20:32:00.4321777Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.4321989Z 2025-05-07T20:32:00.4322104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.4322514Z self=, 2025-05-07T20:32:00.4322923Z T=4096, 2025-05-07T20:32:00.4323120Z D=7168, 2025-05-07T20:32:00.4323322Z scale_ub=1200.0, 2025-05-07T20:32:00.4323552Z contiguous=True, 2025-05-07T20:32:00.4323783Z compiled=False, 2025-05-07T20:32:00.4323996Z ) 2025-05-07T20:32:00.4324320Z self = 2025-05-07T20:32:00.4324814Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.4325090Z 2025-05-07T20:32:00.4325180Z @given( 2025-05-07T20:32:00.4325409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4325727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4326034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4326361Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4326693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4326984Z ) 2025-05-07T20:32:00.4327332Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4327778Z def test_silu_mul_quant( 2025-05-07T20:32:00.4328023Z self, 2025-05-07T20:32:00.4328224Z T: int, 2025-05-07T20:32:00.4328422Z D: int, 2025-05-07T20:32:00.4328653Z scale_ub: Optional[float], 2025-05-07T20:32:00.4328929Z contiguous: bool, 2025-05-07T20:32:00.4329166Z compiled: bool, 2025-05-07T20:32:00.4329396Z ) -> None: 2025-05-07T20:32:00.4329617Z torch.manual_seed(2025) 2025-05-07T20:32:00.4329858Z 2025-05-07T20:32:00.4330130Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4332135Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4334610Z 2025-05-07T20:32:00.4334737Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.4334948Z 2025-05-07T20:32:00.4335059Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.4335463Z self=, 2025-05-07T20:32:00.4335865Z T=16384, 2025-05-07T20:32:00.4336067Z D=7168, 2025-05-07T20:32:00.4336264Z scale_ub=None, 2025-05-07T20:32:00.4336487Z contiguous=False, 2025-05-07T20:32:00.4336717Z compiled=True, 2025-05-07T20:32:00.4336920Z ) 2025-05-07T20:32:00.4337242Z self = 2025-05-07T20:32:00.4337737Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:00.4338093Z 2025-05-07T20:32:00.4338175Z @given( 2025-05-07T20:32:00.4338413Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4338733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4339051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4339378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4339711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4340004Z ) 2025-05-07T20:32:00.4340350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4340798Z def test_silu_mul_quant( 2025-05-07T20:32:00.4341045Z self, 2025-05-07T20:32:00.4341242Z T: int, 2025-05-07T20:32:00.4341445Z D: int, 2025-05-07T20:32:00.4341675Z scale_ub: Optional[float], 2025-05-07T20:32:00.4341951Z contiguous: bool, 2025-05-07T20:32:00.4342197Z compiled: bool, 2025-05-07T20:32:00.4342427Z ) -> None: 2025-05-07T20:32:00.4342647Z torch.manual_seed(2025) 2025-05-07T20:32:00.4342891Z 2025-05-07T20:32:00.4343164Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4345181Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4346997Z 2025-05-07T20:32:00.4347118Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.4347328Z 2025-05-07T20:32:00.4347436Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.4347852Z self=, 2025-05-07T20:32:00.4348256Z T=4096, 2025-05-07T20:32:00.4348448Z D=7168, 2025-05-07T20:32:00.4348641Z scale_ub=None, 2025-05-07T20:32:00.4348860Z contiguous=True, 2025-05-07T20:32:00.4349091Z compiled=False, 2025-05-07T20:32:00.4349293Z ) 2025-05-07T20:32:00.4349613Z self = 2025-05-07T20:32:00.4350107Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.4350378Z 2025-05-07T20:32:00.4350467Z @given( 2025-05-07T20:32:00.4350699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4351016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4351329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4351657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4351987Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4352282Z ) 2025-05-07T20:32:00.4352629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4353076Z def test_silu_mul_quant( 2025-05-07T20:32:00.4353324Z self, 2025-05-07T20:32:00.4353620Z T: int, 2025-05-07T20:32:00.4353817Z D: int, 2025-05-07T20:32:00.4354039Z scale_ub: Optional[float], 2025-05-07T20:32:00.4354316Z contiguous: bool, 2025-05-07T20:32:00.4354553Z compiled: bool, 2025-05-07T20:32:00.4354783Z ) -> None: 2025-05-07T20:32:00.4355005Z torch.manual_seed(2025) 2025-05-07T20:32:00.4355245Z 2025-05-07T20:32:00.4355520Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4357621Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4359436Z 2025-05-07T20:32:00.4359565Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.4359780Z 2025-05-07T20:32:00.4359891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.4360299Z self=, 2025-05-07T20:32:00.4360706Z T=16384, 2025-05-07T20:32:00.4360906Z D=7168, 2025-05-07T20:32:00.4361100Z scale_ub=None, 2025-05-07T20:32:00.4361323Z contiguous=True, 2025-05-07T20:32:00.4361556Z compiled=False, 2025-05-07T20:32:00.4361760Z ) 2025-05-07T20:32:00.4362079Z self = 2025-05-07T20:32:00.4362577Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:00.4362852Z 2025-05-07T20:32:00.4362932Z @given( 2025-05-07T20:32:00.4363168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4363491Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4363803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4364130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4364464Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4364755Z ) 2025-05-07T20:32:00.4365098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4365541Z def test_silu_mul_quant( 2025-05-07T20:32:00.4365785Z self, 2025-05-07T20:32:00.4365981Z T: int, 2025-05-07T20:32:00.4366185Z D: int, 2025-05-07T20:32:00.4366407Z scale_ub: Optional[float], 2025-05-07T20:32:00.4366676Z contiguous: bool, 2025-05-07T20:32:00.4366925Z compiled: bool, 2025-05-07T20:32:00.4367151Z ) -> None: 2025-05-07T20:32:00.4367363Z torch.manual_seed(2025) 2025-05-07T20:32:00.4367607Z 2025-05-07T20:32:00.4367882Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4369873Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4371683Z 2025-05-07T20:32:00.4371807Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.4372016Z 2025-05-07T20:32:00.4372124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.4372535Z self=, 2025-05-07T20:32:00.4372935Z T=16384, 2025-05-07T20:32:00.4373127Z D=7168, 2025-05-07T20:32:00.4373418Z scale_ub=1200.0, 2025-05-07T20:32:00.4373717Z contiguous=True, 2025-05-07T20:32:00.4373935Z compiled=False, 2025-05-07T20:32:00.4374143Z ) 2025-05-07T20:32:00.4374456Z self = 2025-05-07T20:32:00.4374947Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:00.4375223Z 2025-05-07T20:32:00.4375304Z @given( 2025-05-07T20:32:00.4375534Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.4375848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.4376147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.4376477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.4376885Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.4377169Z ) 2025-05-07T20:32:00.4377516Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.4377952Z def test_silu_mul_quant( 2025-05-07T20:32:00.4378202Z self, 2025-05-07T20:32:00.4378395Z T: int, 2025-05-07T20:32:00.4378594Z D: int, 2025-05-07T20:32:00.4378817Z scale_ub: Optional[float], 2025-05-07T20:32:00.4379084Z contiguous: bool, 2025-05-07T20:32:00.4379323Z compiled: bool, 2025-05-07T20:32:00.4379542Z ) -> None: 2025-05-07T20:32:00.4379749Z torch.manual_seed(2025) 2025-05-07T20:32:00.4379989Z 2025-05-07T20:32:00.4380263Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.4382245Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.4384063Z 2025-05-07T20:32:00.4384179Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.6187457Z 2025-05-07T20:32:00.6188161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.6188805Z self=, 2025-05-07T20:32:00.6189342Z T=128, 2025-05-07T20:32:00.6189591Z D=5120, 2025-05-07T20:32:00.6189843Z scale_ub=1200.0, 2025-05-07T20:32:00.6190132Z contiguous=False, 2025-05-07T20:32:00.6190359Z compiled=False, 2025-05-07T20:32:00.6190571Z ) 2025-05-07T20:32:00.6190909Z self = 2025-05-07T20:32:00.6191408Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:00.6191683Z 2025-05-07T20:32:00.6191762Z @given( 2025-05-07T20:32:00.6192005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.6192311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.6192621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.6192950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.6193268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.6193554Z ) 2025-05-07T20:32:00.6193904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.6194337Z def test_silu_mul_quant( 2025-05-07T20:32:00.6194577Z self, 2025-05-07T20:32:00.6194770Z T: int, 2025-05-07T20:32:00.6194970Z D: int, 2025-05-07T20:32:00.6195181Z scale_ub: Optional[float], 2025-05-07T20:32:00.6195462Z contiguous: bool, 2025-05-07T20:32:00.6195702Z compiled: bool, 2025-05-07T20:32:00.6195925Z ) -> None: 2025-05-07T20:32:00.6196142Z torch.manual_seed(2025) 2025-05-07T20:32:00.6196385Z 2025-05-07T20:32:00.6197048Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.6197389Z 2025-05-07T20:32:00.6197589Z x_sign = torch.sign(x) 2025-05-07T20:32:00.6197870Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.6198436Z x = x_sign * x_clamp 2025-05-07T20:32:00.6198709Z x0 = x[:, :D] 2025-05-07T20:32:00.6198924Z x1 = x[:, D:] 2025-05-07T20:32:00.6199127Z 2025-05-07T20:32:00.6199305Z if contiguous: 2025-05-07T20:32:00.6199538Z x0 = x0.contiguous() 2025-05-07T20:32:00.6199796Z x1 = x1.contiguous() 2025-05-07T20:32:00.6200029Z 2025-05-07T20:32:00.6200221Z if scale_ub is not None: 2025-05-07T20:32:00.6200655Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.6200985Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.6201293Z ) 2025-05-07T20:32:00.6201487Z else: 2025-05-07T20:32:00.6201705Z scale_ub_tensor = None 2025-05-07T20:32:00.6201955Z 2025-05-07T20:32:00.6202185Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.6202494Z op = silu_mul_quant 2025-05-07T20:32:00.6202742Z if compiled: 2025-05-07T20:32:00.6202984Z op = torch.compile(op) 2025-05-07T20:32:00.6203277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.6203544Z 2025-05-07T20:32:00.6203736Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.6203900Z 2025-05-07T20:32:00.6204002Z moe/activation_test.py:117: 2025-05-07T20:32:00.6204286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6204619Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.6204910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.6205585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.6206275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.6206808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.6207477Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.6208125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.6208652Z kernel = self.compile( 2025-05-07T20:32:00.6209189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.6209836Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.6210229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6210458Z 2025-05-07T20:32:00.6210661Z self = 2025-05-07T20:32:00.6211719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.6213080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397275bc0>} 2025-05-07T20:32:00.6214574Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.6215581Z context = 2025-05-07T20:32:00.6215874Z 2025-05-07T20:32:00.6216037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.6216551Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.6217146Z module_map=module_map) 2025-05-07T20:32:00.6217508Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.6217856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.6218110Z E ^ 2025-05-07T20:32:00.6218565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.6219007Z 2025-05-07T20:32:00.6219412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:00.6219913Z 2025-05-07T20:32:00.6220021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.6220501Z self=, 2025-05-07T20:32:00.6220902Z T=2048, 2025-05-07T20:32:00.6221092Z D=7168, 2025-05-07T20:32:00.6221278Z scale_ub=None, 2025-05-07T20:32:00.6221491Z contiguous=False, 2025-05-07T20:32:00.6221718Z compiled=False, 2025-05-07T20:32:00.6221922Z ) 2025-05-07T20:32:00.6222243Z self = 2025-05-07T20:32:00.6222730Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:00.6222996Z 2025-05-07T20:32:00.6223079Z @given( 2025-05-07T20:32:00.6223300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.6223610Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.6223916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.6224237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.6224560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.6224839Z ) 2025-05-07T20:32:00.6225183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.6225619Z def test_silu_mul_quant( 2025-05-07T20:32:00.6225855Z self, 2025-05-07T20:32:00.6226046Z T: int, 2025-05-07T20:32:00.6226241Z D: int, 2025-05-07T20:32:00.6226455Z scale_ub: Optional[float], 2025-05-07T20:32:00.6226723Z contiguous: bool, 2025-05-07T20:32:00.6226959Z compiled: bool, 2025-05-07T20:32:00.6227180Z ) -> None: 2025-05-07T20:32:00.6227392Z torch.manual_seed(2025) 2025-05-07T20:32:00.6227624Z 2025-05-07T20:32:00.6227888Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.6229895Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:00.6231704Z 2025-05-07T20:32:00.6231826Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:00.6232034Z 2025-05-07T20:32:00.6232138Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:00.6232534Z self=, 2025-05-07T20:32:00.6232932Z T=128, 2025-05-07T20:32:00.6233120Z D=7168, 2025-05-07T20:32:00.6233309Z scale_ub=1200.0, 2025-05-07T20:32:00.6233529Z contiguous=True, 2025-05-07T20:32:00.6233746Z compiled=True, 2025-05-07T20:32:00.6233944Z ) 2025-05-07T20:32:00.6234259Z self = 2025-05-07T20:32:00.6234758Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:00.6235023Z 2025-05-07T20:32:00.6235101Z @given( 2025-05-07T20:32:00.6235330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:00.6235631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:00.6236052Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:00.6236385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:00.6236703Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:00.6236993Z ) 2025-05-07T20:32:00.6237337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:00.6237774Z def test_silu_mul_quant( 2025-05-07T20:32:00.6238005Z self, 2025-05-07T20:32:00.6238205Z T: int, 2025-05-07T20:32:00.6238407Z D: int, 2025-05-07T20:32:00.6248758Z scale_ub: Optional[float], 2025-05-07T20:32:00.6249169Z contiguous: bool, 2025-05-07T20:32:00.6249419Z compiled: bool, 2025-05-07T20:32:00.6249769Z ) -> None: 2025-05-07T20:32:00.6249988Z torch.manual_seed(2025) 2025-05-07T20:32:00.6250228Z 2025-05-07T20:32:00.6250493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:00.6250844Z 2025-05-07T20:32:00.6251039Z x_sign = torch.sign(x) 2025-05-07T20:32:00.6251322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:00.6251627Z x = x_sign * x_clamp 2025-05-07T20:32:00.6251864Z x0 = x[:, :D] 2025-05-07T20:32:00.6252067Z x1 = x[:, D:] 2025-05-07T20:32:00.6252280Z 2025-05-07T20:32:00.6252461Z if contiguous: 2025-05-07T20:32:00.6252690Z x0 = x0.contiguous() 2025-05-07T20:32:00.6252935Z x1 = x1.contiguous() 2025-05-07T20:32:00.6253174Z 2025-05-07T20:32:00.6253371Z if scale_ub is not None: 2025-05-07T20:32:00.6253701Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:00.6254051Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:00.6254367Z ) 2025-05-07T20:32:00.6254557Z else: 2025-05-07T20:32:00.6254769Z scale_ub_tensor = None 2025-05-07T20:32:00.6255023Z 2025-05-07T20:32:00.6255243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:00.6255573Z op = silu_mul_quant 2025-05-07T20:32:00.6255822Z if compiled: 2025-05-07T20:32:00.6256064Z op = torch.compile(op) 2025-05-07T20:32:00.6256359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.6256636Z 2025-05-07T20:32:00.6256820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:00.6256988Z 2025-05-07T20:32:00.6257084Z moe/activation_test.py:117: 2025-05-07T20:32:00.6257376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6257711Z moe/activation_test.py:115: in fn 2025-05-07T20:32:00.6257987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:00.6258550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:00.6259107Z return fn(*args, **kwargs) 2025-05-07T20:32:00.6259750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:00.6260428Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:00.6260957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:00.6261625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:00.6262270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:00.6262795Z kernel = self.compile( 2025-05-07T20:32:00.6263326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:00.6263969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:00.6264358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:00.6264581Z 2025-05-07T20:32:00.6264781Z self = 2025-05-07T20:32:00.6265931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:00.6267271Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93971742c0>} 2025-05-07T20:32:00.6268587Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:00.6269648Z context = 2025-05-07T20:32:00.6269923Z 2025-05-07T20:32:00.6270085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:00.6270591Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:00.6271049Z module_map=module_map) 2025-05-07T20:32:00.6271409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:00.6271746Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:00.6272004Z E ^ 2025-05-07T20:32:00.6272455Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:00.6272889Z 2025-05-07T20:32:00.6273295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2247238Z 2025-05-07T20:32:01.2247694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2248584Z self=, 2025-05-07T20:32:01.2249448Z T=128, 2025-05-07T20:32:01.2249839Z D=7168, 2025-05-07T20:32:01.2250236Z scale_ub=1200.0, 2025-05-07T20:32:01.2250697Z contiguous=True, 2025-05-07T20:32:01.2251145Z compiled=False, 2025-05-07T20:32:01.2251571Z ) 2025-05-07T20:32:01.2252209Z self = 2025-05-07T20:32:01.2253195Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2253906Z 2025-05-07T20:32:01.2254082Z @given( 2025-05-07T20:32:01.2254549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2255163Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2255514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2255862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2256187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2256486Z ) 2025-05-07T20:32:01.2256840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2257281Z def test_silu_mul_quant( 2025-05-07T20:32:01.2257540Z self, 2025-05-07T20:32:01.2257743Z T: int, 2025-05-07T20:32:01.2257941Z D: int, 2025-05-07T20:32:01.2258166Z scale_ub: Optional[float], 2025-05-07T20:32:01.2258442Z contiguous: bool, 2025-05-07T20:32:01.2258682Z compiled: bool, 2025-05-07T20:32:01.2258917Z ) -> None: 2025-05-07T20:32:01.2259142Z torch.manual_seed(2025) 2025-05-07T20:32:01.2259382Z 2025-05-07T20:32:01.2259663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2260015Z 2025-05-07T20:32:01.2260211Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2260506Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2262482Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2264510Z 2025-05-07T20:32:01.2264633Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.2264850Z 2025-05-07T20:32:01.2264958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2265365Z self=, 2025-05-07T20:32:01.2265761Z T=128, 2025-05-07T20:32:01.2265955Z D=5120, 2025-05-07T20:32:01.2266156Z scale_ub=1200.0, 2025-05-07T20:32:01.2266380Z contiguous=True, 2025-05-07T20:32:01.2266739Z compiled=True, 2025-05-07T20:32:01.2266954Z ) 2025-05-07T20:32:01.2267269Z self = 2025-05-07T20:32:01.2267749Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.2268023Z 2025-05-07T20:32:01.2268103Z @given( 2025-05-07T20:32:01.2268337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2268646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2268954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2269285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2269604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2269890Z ) 2025-05-07T20:32:01.2270237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2270673Z def test_silu_mul_quant( 2025-05-07T20:32:01.2270917Z self, 2025-05-07T20:32:01.2271113Z T: int, 2025-05-07T20:32:01.2271313Z D: int, 2025-05-07T20:32:01.2271534Z scale_ub: Optional[float], 2025-05-07T20:32:01.2271805Z contiguous: bool, 2025-05-07T20:32:01.2272050Z compiled: bool, 2025-05-07T20:32:01.2272276Z ) -> None: 2025-05-07T20:32:01.2272499Z torch.manual_seed(2025) 2025-05-07T20:32:01.2272741Z 2025-05-07T20:32:01.2273008Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2273345Z 2025-05-07T20:32:01.2273543Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.2275431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2277243Z 2025-05-07T20:32:01.2277364Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.2277587Z 2025-05-07T20:32:01.2277689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2278100Z self=, 2025-05-07T20:32:01.2278501Z T=128, 2025-05-07T20:32:01.2278688Z D=7168, 2025-05-07T20:32:01.2278881Z scale_ub=None, 2025-05-07T20:32:01.2279098Z contiguous=True, 2025-05-07T20:32:01.2279319Z compiled=True, 2025-05-07T20:32:01.2279521Z ) 2025-05-07T20:32:01.2279837Z self = 2025-05-07T20:32:01.2280312Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2280582Z 2025-05-07T20:32:01.2280662Z @given( 2025-05-07T20:32:01.2280901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2281213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2281522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2281848Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2282263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2282539Z ) 2025-05-07T20:32:01.2282882Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2283319Z def test_silu_mul_quant( 2025-05-07T20:32:01.2283552Z self, 2025-05-07T20:32:01.2283749Z T: int, 2025-05-07T20:32:01.2283947Z D: int, 2025-05-07T20:32:01.2284165Z scale_ub: Optional[float], 2025-05-07T20:32:01.2284436Z contiguous: bool, 2025-05-07T20:32:01.2284678Z compiled: bool, 2025-05-07T20:32:01.2284894Z ) -> None: 2025-05-07T20:32:01.2285110Z torch.manual_seed(2025) 2025-05-07T20:32:01.2285355Z 2025-05-07T20:32:01.2285699Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2287688Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2289498Z 2025-05-07T20:32:01.2289617Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.2289830Z 2025-05-07T20:32:01.2337692Z FAILED 2025-05-07T20:32:01.2338040Z 2025-05-07T20:32:01.2338505Z =================================== FAILURES =================================== 2025-05-07T20:32:01.2339118Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:01.2339734Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:01.2340574Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:01.2341321Z | yield 2025-05-07T20:32:01.2341903Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:01.2342615Z | self._callTestMethod(testMethod) 2025-05-07T20:32:01.2343012Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:01.2343748Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:01.2344487Z | if method() is not None: 2025-05-07T20:32:01.2344824Z | ~~~~~~^^ 2025-05-07T20:32:01.2345695Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:01.2346716Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2347114Z | ^^^^^^^ 2025-05-07T20:32:01.2347886Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:01.2348756Z | raise the_error_hypothesis_found 2025-05-07T20:32:01.2349320Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:01.2349894Z +-+---------------- 1 ---------------- 2025-05-07T20:32:01.2350301Z | Traceback (most recent call last): 2025-05-07T20:32:01.2351281Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.2352337Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2355182Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2357876Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.2358307Z | self=, 2025-05-07T20:32:01.2358709Z | T=128, 2025-05-07T20:32:01.2358901Z | D=7168, 2025-05-07T20:32:01.2359110Z | scale_ub=1200.0, 2025-05-07T20:32:01.2359668Z | contiguous=True, 2025-05-07T20:32:01.2359903Z | compiled=False, 2025-05-07T20:32:01.2360132Z | ) 2025-05-07T20:32:01.2360464Z | 2025-05-07T20:32:01.2360987Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:01.2361581Z +---------------- 2 ---------------- 2025-05-07T20:32:01.2361874Z | Traceback (most recent call last): 2025-05-07T20:32:01.2362564Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.2363322Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2365312Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2367214Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.2367645Z | self=, 2025-05-07T20:32:01.2368046Z | T=128, 2025-05-07T20:32:01.2368237Z | D=7168, 2025-05-07T20:32:01.2368445Z | scale_ub=None, 2025-05-07T20:32:01.2368682Z | contiguous=True, 2025-05-07T20:32:01.2368912Z | compiled=True, 2025-05-07T20:32:01.2369131Z | ) 2025-05-07T20:32:01.2369309Z | 2025-05-07T20:32:01.2369812Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.2370401Z +---------------- 3 ---------------- 2025-05-07T20:32:01.2370694Z | Traceback (most recent call last): 2025-05-07T20:32:01.2371388Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:01.2372135Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2374282Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.2376540Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.2377142Z | self=, 2025-05-07T20:32:01.2377685Z | T=128, 2025-05-07T20:32:01.2377946Z | D=5120, 2025-05-07T20:32:01.2378230Z | scale_ub=1200.0, 2025-05-07T20:32:01.2378675Z | contiguous=True, 2025-05-07T20:32:01.2378990Z | compiled=True, 2025-05-07T20:32:01.2379300Z | ) 2025-05-07T20:32:01.2379540Z | 2025-05-07T20:32:01.2380227Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.2381044Z +---------------- 4 ---------------- 2025-05-07T20:32:01.2381438Z | Traceback (most recent call last): 2025-05-07T20:32:01.2382402Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:01.2383371Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2383852Z | ~~~~~~^^ 2025-05-07T20:32:01.2384723Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:01.2385652Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2386772Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:01.2387850Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2388228Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:01.2388568Z | a, 2025-05-07T20:32:01.2388835Z | ^^ 2025-05-07T20:32:01.2389116Z | ...<23 lines>... 2025-05-07T20:32:01.2389436Z | USE_INT64=use_int64, 2025-05-07T20:32:01.2389791Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2390118Z | ) 2025-05-07T20:32:01.2390356Z | ^ 2025-05-07T20:32:01.2391054Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:01.2392061Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2392672Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2393531Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:01.2394573Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2395211Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2396070Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:01.2396966Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2397352Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:01.2397955Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:01.2399104Z | fn() 2025-05-07T20:32:01.2399302Z | ~~^^ 2025-05-07T20:32:01.2399863Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:01.2400490Z | self.fn.run( 2025-05-07T20:32:01.2400738Z | ~~~~~~~~~~~^ 2025-05-07T20:32:01.2401028Z | *args, 2025-05-07T20:32:01.2401243Z | ^^^^^^ 2025-05-07T20:32:01.2401455Z | **current, 2025-05-07T20:32:01.2401681Z | ^^^^^^^^^^ 2025-05-07T20:32:01.2401903Z | ) 2025-05-07T20:32:01.2402086Z | ^ 2025-05-07T20:32:01.2402590Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:01.2403165Z | kernel = self.compile( 2025-05-07T20:32:01.2403420Z | src, 2025-05-07T20:32:01.2403630Z | target=target, 2025-05-07T20:32:01.2404090Z | options=options.__dict__, 2025-05-07T20:32:01.2404361Z | ) 2025-05-07T20:32:01.2404896Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:01.2405647Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2406347Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:01.2407120Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2407587Z | module_map=module_map) 2025-05-07T20:32:01.2408074Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2408431Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2408693Z | ^ 2025-05-07T20:32:01.2409149Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2409716Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:01.2410108Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:01.2410624Z | self=, 2025-05-07T20:32:01.2411064Z | T=1, # or any other generated value 2025-05-07T20:32:01.2411377Z | D=5120, # or any other generated value 2025-05-07T20:32:01.2411747Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:01.2412235Z | contiguous=True, # or any other generated value 2025-05-07T20:32:01.2412726Z | compiled=True, # or any other generated value 2025-05-07T20:32:01.2413129Z | ) 2025-05-07T20:32:01.2413373Z | 2025-05-07T20:32:01.2414196Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:01.2415016Z +------------------------------------ 2025-05-07T20:32:01.2415508Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:01.2416019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2416599Z self=, 2025-05-07T20:32:01.2417149Z T=1, 2025-05-07T20:32:01.2417399Z D=5120, 2025-05-07T20:32:01.2417670Z scale_ub=None, 2025-05-07T20:32:01.2417973Z contiguous=True, 2025-05-07T20:32:01.2418277Z compiled=True, 2025-05-07T20:32:01.2418562Z ) 2025-05-07T20:32:01.2419000Z self = 2025-05-07T20:32:01.2419665Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2420020Z 2025-05-07T20:32:01.2420130Z @given( 2025-05-07T20:32:01.2420450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2420881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2444639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2445142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2445612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2446017Z ) 2025-05-07T20:32:01.2446490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2447100Z def test_silu_mul_quant( 2025-05-07T20:32:01.2447435Z self, 2025-05-07T20:32:01.2447705Z T: int, 2025-05-07T20:32:01.2447972Z D: int, 2025-05-07T20:32:01.2448276Z scale_ub: Optional[float], 2025-05-07T20:32:01.2448654Z contiguous: bool, 2025-05-07T20:32:01.2448983Z compiled: bool, 2025-05-07T20:32:01.2449312Z ) -> None: 2025-05-07T20:32:01.2449608Z torch.manual_seed(2025) 2025-05-07T20:32:01.2449919Z 2025-05-07T20:32:01.2450288Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2450781Z 2025-05-07T20:32:01.2451285Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2451697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2452129Z x = x_sign * x_clamp 2025-05-07T20:32:01.2452464Z x0 = x[:, :D] 2025-05-07T20:32:01.2452760Z x1 = x[:, D:] 2025-05-07T20:32:01.2453052Z 2025-05-07T20:32:01.2453310Z if contiguous: 2025-05-07T20:32:01.2453799Z x0 = x0.contiguous() 2025-05-07T20:32:01.2454168Z x1 = x1.contiguous() 2025-05-07T20:32:01.2454496Z 2025-05-07T20:32:01.2454763Z if scale_ub is not None: 2025-05-07T20:32:01.2455142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2455608Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2456148Z ) 2025-05-07T20:32:01.2456427Z else: 2025-05-07T20:32:01.2456726Z scale_ub_tensor = None 2025-05-07T20:32:01.2457069Z 2025-05-07T20:32:01.2457398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2457849Z op = silu_mul_quant 2025-05-07T20:32:01.2458187Z if compiled: 2025-05-07T20:32:01.2458528Z op = torch.compile(op) 2025-05-07T20:32:01.2458916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2459276Z 2025-05-07T20:32:01.2459539Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2459907Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2460277Z 2025-05-07T20:32:01.2460585Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2461020Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2461409Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2461835Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2462299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2462708Z 2025-05-07T20:32:01.2462983Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2463264Z 2025-05-07T20:32:01.2463414Z moe/activation_test.py:126: 2025-05-07T20:32:01.2463822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2464291Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2464710Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2465825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2466834Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2467527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2468432Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2469306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2470201Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2471113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2471916Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2472668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2473344Z fn() 2025-05-07T20:32:01.2473997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2474771Z self.fn.run( 2025-05-07T20:32:01.2475389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2476079Z kernel = self.compile( 2025-05-07T20:32:01.2476769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2477723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2478237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2478534Z 2025-05-07T20:32:01.2478803Z self = 2025-05-07T20:32:01.2480202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2482089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93bd68d3a0>} 2025-05-07T20:32:01.2483823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2485138Z context = 2025-05-07T20:32:01.2485499Z 2025-05-07T20:32:01.2485716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2486381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2486995Z module_map=module_map) 2025-05-07T20:32:01.2487465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2487924Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2488282Z E ^ 2025-05-07T20:32:01.2488889Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2489464Z 2025-05-07T20:32:01.2489980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2490604Z 2025-05-07T20:32:01.2490735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2491231Z self=, 2025-05-07T20:32:01.2491716Z T=2048, 2025-05-07T20:32:01.2491935Z D=5120, 2025-05-07T20:32:01.2492167Z scale_ub=1200.0, 2025-05-07T20:32:01.2492439Z contiguous=True, 2025-05-07T20:32:01.2492696Z compiled=False, 2025-05-07T20:32:01.2492943Z ) 2025-05-07T20:32:01.2493323Z self = 2025-05-07T20:32:01.2494063Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2494388Z 2025-05-07T20:32:01.2494480Z @given( 2025-05-07T20:32:01.2494756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2495168Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2495536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2496172Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2496607Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2496973Z ) 2025-05-07T20:32:01.2497410Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2497947Z def test_silu_mul_quant( 2025-05-07T20:32:01.2498513Z self, 2025-05-07T20:32:01.2498771Z T: int, 2025-05-07T20:32:01.2499005Z D: int, 2025-05-07T20:32:01.2499260Z scale_ub: Optional[float], 2025-05-07T20:32:01.2499577Z contiguous: bool, 2025-05-07T20:32:01.2499857Z compiled: bool, 2025-05-07T20:32:01.2500121Z ) -> None: 2025-05-07T20:32:01.2500366Z torch.manual_seed(2025) 2025-05-07T20:32:01.2500653Z 2025-05-07T20:32:01.2500977Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2501396Z 2025-05-07T20:32:01.2501635Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2501977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2502595Z x = x_sign * x_clamp 2025-05-07T20:32:01.2502886Z x0 = x[:, :D] 2025-05-07T20:32:01.2503147Z x1 = x[:, D:] 2025-05-07T20:32:01.2503393Z 2025-05-07T20:32:01.2503634Z if contiguous: 2025-05-07T20:32:01.2503933Z x0 = x0.contiguous() 2025-05-07T20:32:01.2504232Z x1 = x1.contiguous() 2025-05-07T20:32:01.2504515Z 2025-05-07T20:32:01.2504758Z if scale_ub is not None: 2025-05-07T20:32:01.2505117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2505551Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2505931Z ) 2025-05-07T20:32:01.2506165Z else: 2025-05-07T20:32:01.2506410Z scale_ub_tensor = None 2025-05-07T20:32:01.2506869Z 2025-05-07T20:32:01.2507153Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2507525Z op = silu_mul_quant 2025-05-07T20:32:01.2507843Z if compiled: 2025-05-07T20:32:01.2508156Z op = torch.compile(op) 2025-05-07T20:32:01.2508508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2508845Z 2025-05-07T20:32:01.2509074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2509275Z 2025-05-07T20:32:01.2509391Z moe/activation_test.py:117: 2025-05-07T20:32:01.2509751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2510159Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2510498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2511326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2512175Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2512862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2513701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2514547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2515238Z kernel = self.compile( 2025-05-07T20:32:01.2515892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2516674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2517150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2517439Z 2025-05-07T20:32:01.2517694Z self = 2025-05-07T20:32:01.2519075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2520830Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93ce57e200>} 2025-05-07T20:32:01.2522550Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2523862Z context = 2025-05-07T20:32:01.2524228Z 2025-05-07T20:32:01.2524446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2525115Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2525739Z module_map=module_map) 2025-05-07T20:32:01.2526229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2526686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2527039Z E ^ 2025-05-07T20:32:01.2527749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2528351Z 2025-05-07T20:32:01.2528942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2529631Z 2025-05-07T20:32:01.2529780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2530324Z self=, 2025-05-07T20:32:01.2530868Z T=2048, 2025-05-07T20:32:01.2531121Z D=5120, 2025-05-07T20:32:01.2531372Z scale_ub=1200.0, 2025-05-07T20:32:01.2531672Z contiguous=True, 2025-05-07T20:32:01.2531970Z compiled=True, 2025-05-07T20:32:01.2532236Z ) 2025-05-07T20:32:01.2532794Z self = 2025-05-07T20:32:01.2533455Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.2533962Z 2025-05-07T20:32:01.2534073Z @given( 2025-05-07T20:32:01.2534376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2534799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2535214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2535694Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2536141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2536522Z ) 2025-05-07T20:32:01.2536990Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2537571Z def test_silu_mul_quant( 2025-05-07T20:32:01.2537884Z self, 2025-05-07T20:32:01.2538129Z T: int, 2025-05-07T20:32:01.2538385Z D: int, 2025-05-07T20:32:01.2538679Z scale_ub: Optional[float], 2025-05-07T20:32:01.2539037Z contiguous: bool, 2025-05-07T20:32:01.2539365Z compiled: bool, 2025-05-07T20:32:01.2539666Z ) -> None: 2025-05-07T20:32:01.2539945Z torch.manual_seed(2025) 2025-05-07T20:32:01.2540290Z 2025-05-07T20:32:01.2540654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2541116Z 2025-05-07T20:32:01.2541374Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2541767Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2542184Z x = x_sign * x_clamp 2025-05-07T20:32:01.2542496Z x0 = x[:, :D] 2025-05-07T20:32:01.2542793Z x1 = x[:, D:] 2025-05-07T20:32:01.2543070Z 2025-05-07T20:32:01.2543310Z if contiguous: 2025-05-07T20:32:01.2543616Z x0 = x0.contiguous() 2025-05-07T20:32:01.2543962Z x1 = x1.contiguous() 2025-05-07T20:32:01.2544279Z 2025-05-07T20:32:01.2544544Z if scale_ub is not None: 2025-05-07T20:32:01.2544914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2545360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2545771Z ) 2025-05-07T20:32:01.2546036Z else: 2025-05-07T20:32:01.2546318Z scale_ub_tensor = None 2025-05-07T20:32:01.2546659Z 2025-05-07T20:32:01.2546970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2547394Z op = silu_mul_quant 2025-05-07T20:32:01.2547722Z if compiled: 2025-05-07T20:32:01.2548054Z op = torch.compile(op) 2025-05-07T20:32:01.2548450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2548820Z 2025-05-07T20:32:01.2549079Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2549455Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2549837Z 2025-05-07T20:32:01.2550154Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2550607Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2550993Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2551403Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2551978Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2552378Z 2025-05-07T20:32:01.2552638Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2552894Z 2025-05-07T20:32:01.2553023Z moe/activation_test.py:126: 2025-05-07T20:32:01.2553408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2553854Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2554306Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2555386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2556398Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2557232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2558145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2559070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2560021Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2560974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2561838Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2562637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2563310Z fn() 2025-05-07T20:32:01.2564010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2564796Z self.fn.run( 2025-05-07T20:32:01.2565447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2566204Z kernel = self.compile( 2025-05-07T20:32:01.2566919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2567780Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2568309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2568633Z 2025-05-07T20:32:01.2568910Z self = 2025-05-07T20:32:01.2570351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2572265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93ce4bede0>} 2025-05-07T20:32:01.2574308Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2575677Z context = 2025-05-07T20:32:01.2576062Z 2025-05-07T20:32:01.2576280Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2576958Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2577559Z module_map=module_map) 2025-05-07T20:32:01.2578027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2578492Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2578847Z E ^ 2025-05-07T20:32:01.2579450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2580132Z 2025-05-07T20:32:01.2580630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2581251Z 2025-05-07T20:32:01.2581379Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2581871Z self=, 2025-05-07T20:32:01.2582346Z T=16384, 2025-05-07T20:32:01.2582591Z D=7168, 2025-05-07T20:32:01.2582842Z scale_ub=1200.0, 2025-05-07T20:32:01.2583104Z contiguous=False, 2025-05-07T20:32:01.2583388Z compiled=False, 2025-05-07T20:32:01.2583663Z ) 2025-05-07T20:32:01.2584090Z self = 2025-05-07T20:32:01.2584861Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.2585237Z 2025-05-07T20:32:01.2585348Z @given( 2025-05-07T20:32:01.2585654Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2586081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2586490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2586930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2587361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2587746Z ) 2025-05-07T20:32:01.2588222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2588816Z def test_silu_mul_quant( 2025-05-07T20:32:01.2589133Z self, 2025-05-07T20:32:01.2589407Z T: int, 2025-05-07T20:32:01.2589649Z D: int, 2025-05-07T20:32:01.2589913Z scale_ub: Optional[float], 2025-05-07T20:32:01.2590240Z contiguous: bool, 2025-05-07T20:32:01.2590531Z compiled: bool, 2025-05-07T20:32:01.2590797Z ) -> None: 2025-05-07T20:32:01.2591058Z torch.manual_seed(2025) 2025-05-07T20:32:01.2591339Z 2025-05-07T20:32:01.2591702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2592172Z 2025-05-07T20:32:01.2592430Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2592830Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2593253Z x = x_sign * x_clamp 2025-05-07T20:32:01.2593581Z x0 = x[:, :D] 2025-05-07T20:32:01.2593869Z x1 = x[:, D:] 2025-05-07T20:32:01.2594150Z 2025-05-07T20:32:01.2594400Z if contiguous: 2025-05-07T20:32:01.2594700Z x0 = x0.contiguous() 2025-05-07T20:32:01.2595054Z x1 = x1.contiguous() 2025-05-07T20:32:01.2595412Z 2025-05-07T20:32:01.2595693Z if scale_ub is not None: 2025-05-07T20:32:01.2596042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2596388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2596689Z ) 2025-05-07T20:32:01.2596882Z else: 2025-05-07T20:32:01.2597088Z scale_ub_tensor = None 2025-05-07T20:32:01.2597330Z 2025-05-07T20:32:01.2597574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2597883Z op = silu_mul_quant 2025-05-07T20:32:01.2598120Z if compiled: 2025-05-07T20:32:01.2598694Z op = torch.compile(op) 2025-05-07T20:32:01.2598990Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2599263Z 2025-05-07T20:32:01.2599445Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2599609Z 2025-05-07T20:32:01.2599703Z moe/activation_test.py:117: 2025-05-07T20:32:01.2599991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2600306Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2600583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2601268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2601939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2602469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2603397Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2604050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2604569Z kernel = self.compile( 2025-05-07T20:32:01.2605103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2605748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2606141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2606364Z 2025-05-07T20:32:01.2606691Z self = 2025-05-07T20:32:01.2607757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2609107Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a73cd440>} 2025-05-07T20:32:01.2610418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2611411Z context = 2025-05-07T20:32:01.2611697Z 2025-05-07T20:32:01.2611868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2612384Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2612844Z module_map=module_map) 2025-05-07T20:32:01.2613200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2613550Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2613932Z E ^ 2025-05-07T20:32:01.2614379Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2614819Z 2025-05-07T20:32:01.2615223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2615732Z 2025-05-07T20:32:01.2615833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2616236Z self=, 2025-05-07T20:32:01.2616626Z T=1, 2025-05-07T20:32:01.2616815Z D=7168, 2025-05-07T20:32:01.2617005Z scale_ub=None, 2025-05-07T20:32:01.2617206Z contiguous=True, 2025-05-07T20:32:01.2617423Z compiled=True, 2025-05-07T20:32:01.2617620Z ) 2025-05-07T20:32:01.2617927Z self = 2025-05-07T20:32:01.2618405Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2618663Z 2025-05-07T20:32:01.2618740Z @given( 2025-05-07T20:32:01.2618968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2619270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2619571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2619892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2620206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2620491Z ) 2025-05-07T20:32:01.2620831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2621265Z def test_silu_mul_quant( 2025-05-07T20:32:01.2621504Z self, 2025-05-07T20:32:01.2621693Z T: int, 2025-05-07T20:32:01.2621885Z D: int, 2025-05-07T20:32:01.2622093Z scale_ub: Optional[float], 2025-05-07T20:32:01.2622477Z contiguous: bool, 2025-05-07T20:32:01.2622709Z compiled: bool, 2025-05-07T20:32:01.2622918Z ) -> None: 2025-05-07T20:32:01.2623128Z torch.manual_seed(2025) 2025-05-07T20:32:01.2623365Z 2025-05-07T20:32:01.2623623Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2623957Z 2025-05-07T20:32:01.2624152Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2624429Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2624734Z x = x_sign * x_clamp 2025-05-07T20:32:01.2624973Z x0 = x[:, :D] 2025-05-07T20:32:01.2625178Z x1 = x[:, D:] 2025-05-07T20:32:01.2625383Z 2025-05-07T20:32:01.2643794Z if contiguous: 2025-05-07T20:32:01.2644253Z x0 = x0.contiguous() 2025-05-07T20:32:01.2644530Z x1 = x1.contiguous() 2025-05-07T20:32:01.2644772Z 2025-05-07T20:32:01.2644976Z if scale_ub is not None: 2025-05-07T20:32:01.2645260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2645605Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2645922Z ) 2025-05-07T20:32:01.2646124Z else: 2025-05-07T20:32:01.2646336Z scale_ub_tensor = None 2025-05-07T20:32:01.2646596Z 2025-05-07T20:32:01.2646843Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2647162Z op = silu_mul_quant 2025-05-07T20:32:01.2647419Z if compiled: 2025-05-07T20:32:01.2647680Z op = torch.compile(op) 2025-05-07T20:32:01.2647976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2648261Z 2025-05-07T20:32:01.2648463Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2648763Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2649054Z 2025-05-07T20:32:01.2649302Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2649638Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2649938Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2650252Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2650615Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2650922Z 2025-05-07T20:32:01.2651130Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2651325Z 2025-05-07T20:32:01.2651435Z moe/activation_test.py:126: 2025-05-07T20:32:01.2651739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2652080Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2652414Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2653211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2654104Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2654655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2655346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2656035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2656747Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2657474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2658111Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2658718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2659235Z fn() 2025-05-07T20:32:01.2659746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2660456Z self.fn.run( 2025-05-07T20:32:01.2660916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2661442Z kernel = self.compile( 2025-05-07T20:32:01.2661985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2662630Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2663029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2663257Z 2025-05-07T20:32:01.2663463Z self = 2025-05-07T20:32:01.2664604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2665976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a73cf920>} 2025-05-07T20:32:01.2667299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2668307Z context = 2025-05-07T20:32:01.2668586Z 2025-05-07T20:32:01.2668752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2669266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2669736Z module_map=module_map) 2025-05-07T20:32:01.2670105Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2670455Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2670727Z E ^ 2025-05-07T20:32:01.2671185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2671627Z 2025-05-07T20:32:01.2672031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2672547Z 2025-05-07T20:32:01.2672651Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2673064Z self=, 2025-05-07T20:32:01.2673461Z T=4096, 2025-05-07T20:32:01.2673647Z D=5120, 2025-05-07T20:32:01.2673841Z scale_ub=None, 2025-05-07T20:32:01.2674056Z contiguous=False, 2025-05-07T20:32:01.2674275Z compiled=False, 2025-05-07T20:32:01.2674484Z ) 2025-05-07T20:32:01.2674801Z self = 2025-05-07T20:32:01.2675282Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.2675567Z 2025-05-07T20:32:01.2675648Z @given( 2025-05-07T20:32:01.2675878Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2676185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2676498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2676831Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2677162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2677445Z ) 2025-05-07T20:32:01.2677790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2678231Z def test_silu_mul_quant( 2025-05-07T20:32:01.2678466Z self, 2025-05-07T20:32:01.2678661Z T: int, 2025-05-07T20:32:01.2678861Z D: int, 2025-05-07T20:32:01.2679071Z scale_ub: Optional[float], 2025-05-07T20:32:01.2679344Z contiguous: bool, 2025-05-07T20:32:01.2679430Z compiled: bool, 2025-05-07T20:32:01.2679515Z ) -> None: 2025-05-07T20:32:01.2679697Z torch.manual_seed(2025) 2025-05-07T20:32:01.2679770Z 2025-05-07T20:32:01.2679945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2680018Z 2025-05-07T20:32:01.2680116Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2680241Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2680329Z x = x_sign * x_clamp 2025-05-07T20:32:01.2680415Z x0 = x[:, :D] 2025-05-07T20:32:01.2680495Z x1 = x[:, D:] 2025-05-07T20:32:01.2680567Z 2025-05-07T20:32:01.2680658Z if contiguous: 2025-05-07T20:32:01.2680752Z x0 = x0.contiguous() 2025-05-07T20:32:01.2680842Z x1 = x1.contiguous() 2025-05-07T20:32:01.2680920Z 2025-05-07T20:32:01.2681086Z if scale_ub is not None: 2025-05-07T20:32:01.2681192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2681333Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2681420Z ) 2025-05-07T20:32:01.2681506Z else: 2025-05-07T20:32:01.2681601Z scale_ub_tensor = None 2025-05-07T20:32:01.2681673Z 2025-05-07T20:32:01.2681810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2681902Z op = silu_mul_quant 2025-05-07T20:32:01.2681987Z if compiled: 2025-05-07T20:32:01.2682094Z op = torch.compile(op) 2025-05-07T20:32:01.2682200Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2682272Z 2025-05-07T20:32:01.2682369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2682374Z 2025-05-07T20:32:01.2682471Z moe/activation_test.py:117: 2025-05-07T20:32:01.2682605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2682712Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2682812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2683309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2683411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2683767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2683995Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2684329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2684427Z kernel = self.compile( 2025-05-07T20:32:01.2684808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2684987Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2685123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2685129Z 2025-05-07T20:32:01.2685334Z self = 2025-05-07T20:32:01.2686109Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2686606Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a7554cc0>} 2025-05-07T20:32:01.2687339Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2687540Z context = 2025-05-07T20:32:01.2687545Z 2025-05-07T20:32:01.2687709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2687975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2688164Z module_map=module_map) 2025-05-07T20:32:01.2688323Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2688427Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2688503Z E ^ 2025-05-07T20:32:01.2688850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2688862Z 2025-05-07T20:32:01.2689264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2689269Z 2025-05-07T20:32:01.2689442Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2689669Z self=, 2025-05-07T20:32:01.2689747Z T=4096, 2025-05-07T20:32:01.2689823Z D=7168, 2025-05-07T20:32:01.2689913Z scale_ub=None, 2025-05-07T20:32:01.2690005Z contiguous=False, 2025-05-07T20:32:01.2690088Z compiled=False, 2025-05-07T20:32:01.2690167Z ) 2025-05-07T20:32:01.2690382Z self = 2025-05-07T20:32:01.2690559Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.2690563Z 2025-05-07T20:32:01.2690639Z @given( 2025-05-07T20:32:01.2690756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2690860Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2690973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2691089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2691212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2691286Z ) 2025-05-07T20:32:01.2691527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2691627Z def test_silu_mul_quant( 2025-05-07T20:32:01.2691708Z self, 2025-05-07T20:32:01.2691790Z T: int, 2025-05-07T20:32:01.2691866Z D: int, 2025-05-07T20:32:01.2691963Z scale_ub: Optional[float], 2025-05-07T20:32:01.2692058Z contiguous: bool, 2025-05-07T20:32:01.2692144Z compiled: bool, 2025-05-07T20:32:01.2692224Z ) -> None: 2025-05-07T20:32:01.2692322Z torch.manual_seed(2025) 2025-05-07T20:32:01.2692393Z 2025-05-07T20:32:01.2692559Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2692639Z 2025-05-07T20:32:01.2692731Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2692855Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2692948Z x = x_sign * x_clamp 2025-05-07T20:32:01.2693031Z x0 = x[:, :D] 2025-05-07T20:32:01.2693118Z x1 = x[:, D:] 2025-05-07T20:32:01.2693191Z 2025-05-07T20:32:01.2693275Z if contiguous: 2025-05-07T20:32:01.2693371Z x0 = x0.contiguous() 2025-05-07T20:32:01.2693464Z x1 = x1.contiguous() 2025-05-07T20:32:01.2693537Z 2025-05-07T20:32:01.2693734Z if scale_ub is not None: 2025-05-07T20:32:01.2693836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2693967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2694048Z ) 2025-05-07T20:32:01.2694125Z else: 2025-05-07T20:32:01.2694217Z scale_ub_tensor = None 2025-05-07T20:32:01.2694293Z 2025-05-07T20:32:01.2694421Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2694507Z op = silu_mul_quant 2025-05-07T20:32:01.2694592Z if compiled: 2025-05-07T20:32:01.2694688Z op = torch.compile(op) 2025-05-07T20:32:01.2694801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2694872Z 2025-05-07T20:32:01.2694961Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2694966Z 2025-05-07T20:32:01.2695064Z moe/activation_test.py:117: 2025-05-07T20:32:01.2695303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2695421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2695527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2696016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2696118Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2696468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2696689Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2697126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2697217Z kernel = self.compile( 2025-05-07T20:32:01.2697592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2697773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2697897Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2697902Z 2025-05-07T20:32:01.2698109Z self = 2025-05-07T20:32:01.2699294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2699800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a75568e0>} 2025-05-07T20:32:01.2700535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2700726Z context = 2025-05-07T20:32:01.2700731Z 2025-05-07T20:32:01.2700903Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2701157Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2701269Z module_map=module_map) 2025-05-07T20:32:01.2701427Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2701524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2701606Z E ^ 2025-05-07T20:32:01.2701956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2701961Z 2025-05-07T20:32:01.2702364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2702372Z 2025-05-07T20:32:01.2702478Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2702695Z self=, 2025-05-07T20:32:01.2702774Z T=128, 2025-05-07T20:32:01.2702856Z D=7168, 2025-05-07T20:32:01.2702934Z scale_ub=None, 2025-05-07T20:32:01.2703023Z contiguous=False, 2025-05-07T20:32:01.2703104Z compiled=True, 2025-05-07T20:32:01.2703172Z ) 2025-05-07T20:32:01.2703389Z self = 2025-05-07T20:32:01.2703551Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.2703556Z 2025-05-07T20:32:01.2703630Z @given( 2025-05-07T20:32:01.2703756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2703850Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2703970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2704302Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2704412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2704487Z ) 2025-05-07T20:32:01.2704727Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2704816Z def test_silu_mul_quant( 2025-05-07T20:32:01.2704896Z self, 2025-05-07T20:32:01.2704970Z T: int, 2025-05-07T20:32:01.2705041Z D: int, 2025-05-07T20:32:01.2705143Z scale_ub: Optional[float], 2025-05-07T20:32:01.2705229Z contiguous: bool, 2025-05-07T20:32:01.2705311Z compiled: bool, 2025-05-07T20:32:01.2705394Z ) -> None: 2025-05-07T20:32:01.2705485Z torch.manual_seed(2025) 2025-05-07T20:32:01.2705558Z 2025-05-07T20:32:01.2705842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2705918Z 2025-05-07T20:32:01.2706013Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2706134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2706224Z x = x_sign * x_clamp 2025-05-07T20:32:01.2706316Z x0 = x[:, :D] 2025-05-07T20:32:01.2706394Z x1 = x[:, D:] 2025-05-07T20:32:01.2706461Z 2025-05-07T20:32:01.2706550Z if contiguous: 2025-05-07T20:32:01.2706643Z x0 = x0.contiguous() 2025-05-07T20:32:01.2706727Z x1 = x1.contiguous() 2025-05-07T20:32:01.2706802Z 2025-05-07T20:32:01.2706890Z if scale_ub is not None: 2025-05-07T20:32:01.2706999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2707132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2707204Z ) 2025-05-07T20:32:01.2707286Z else: 2025-05-07T20:32:01.2707384Z scale_ub_tensor = None 2025-05-07T20:32:01.2707453Z 2025-05-07T20:32:01.2707584Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2707672Z op = silu_mul_quant 2025-05-07T20:32:01.2707758Z if compiled: 2025-05-07T20:32:01.2707862Z op = torch.compile(op) 2025-05-07T20:32:01.2707964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2708036Z 2025-05-07T20:32:01.2708129Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2708247Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2708320Z 2025-05-07T20:32:01.2708452Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2708554Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2708654Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2708772Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2708916Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2708993Z 2025-05-07T20:32:01.2709091Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2709095Z 2025-05-07T20:32:01.2709195Z moe/activation_test.py:126: 2025-05-07T20:32:01.2709330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2709436Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2709572Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2710117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2710214Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2710581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2710805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2711167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2711416Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2711875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2712039Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2712383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2712455Z fn() 2025-05-07T20:32:01.2712848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2712934Z self.fn.run( 2025-05-07T20:32:01.2713264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2713354Z kernel = self.compile( 2025-05-07T20:32:01.2713809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2713983Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2714120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2714125Z 2025-05-07T20:32:01.2714328Z self = 2025-05-07T20:32:01.2715088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2715587Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a75558a0>} 2025-05-07T20:32:01.2716318Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2716510Z context = 2025-05-07T20:32:01.2716520Z 2025-05-07T20:32:01.2716681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2716943Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2717050Z module_map=module_map) 2025-05-07T20:32:01.2717206Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2717310Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2717385Z E ^ 2025-05-07T20:32:01.2717732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2717736Z 2025-05-07T20:32:01.2718249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2718253Z 2025-05-07T20:32:01.2718385Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2718822Z self=, 2025-05-07T20:32:01.2718980Z T=128, 2025-05-07T20:32:01.2719082Z D=7168, 2025-05-07T20:32:01.2719189Z scale_ub=None, 2025-05-07T20:32:01.2719337Z contiguous=False, 2025-05-07T20:32:01.2719455Z compiled=False, 2025-05-07T20:32:01.2719610Z ) 2025-05-07T20:32:01.2719901Z self = 2025-05-07T20:32:01.2720095Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.2720100Z 2025-05-07T20:32:01.2720235Z @given( 2025-05-07T20:32:01.2720400Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2720510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2720754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2720897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2721036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2721273Z ) 2025-05-07T20:32:01.2721544Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2721757Z def test_silu_mul_quant( 2025-05-07T20:32:01.2721878Z self, 2025-05-07T20:32:01.2721981Z T: int, 2025-05-07T20:32:01.2722134Z D: int, 2025-05-07T20:32:01.2722260Z scale_ub: Optional[float], 2025-05-07T20:32:01.2722374Z contiguous: bool, 2025-05-07T20:32:01.2722552Z compiled: bool, 2025-05-07T20:32:01.2722672Z ) -> None: 2025-05-07T20:32:01.2722794Z torch.manual_seed(2025) 2025-05-07T20:32:01.2722951Z 2025-05-07T20:32:01.2723153Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2723256Z 2025-05-07T20:32:01.2723553Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2723727Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2723905Z x = x_sign * x_clamp 2025-05-07T20:32:01.2724014Z x0 = x[:, :D] 2025-05-07T20:32:01.2724131Z x1 = x[:, D:] 2025-05-07T20:32:01.2724252Z 2025-05-07T20:32:01.2724419Z if contiguous: 2025-05-07T20:32:01.2724559Z x0 = x0.contiguous() 2025-05-07T20:32:01.2724740Z x1 = x1.contiguous() 2025-05-07T20:32:01.2724844Z 2025-05-07T20:32:01.2724967Z if scale_ub is not None: 2025-05-07T20:32:01.2725125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2725366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2725573Z ) 2025-05-07T20:32:01.2725681Z else: 2025-05-07T20:32:01.2725807Z scale_ub_tensor = None 2025-05-07T20:32:01.2725950Z 2025-05-07T20:32:01.2726095Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2726271Z op = silu_mul_quant 2025-05-07T20:32:01.2726460Z if compiled: 2025-05-07T20:32:01.2726595Z op = torch.compile(op) 2025-05-07T20:32:01.2726732Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2726884Z 2025-05-07T20:32:01.2726989Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2726994Z 2025-05-07T20:32:01.2727246Z moe/activation_test.py:117: 2025-05-07T20:32:01.2727407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2727537Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2727701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2728222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2728336Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2728857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2729105Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2729504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2729635Z kernel = self.compile( 2025-05-07T20:32:01.2730044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2730353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2730527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2730532Z 2025-05-07T20:32:01.2730801Z self = 2025-05-07T20:32:01.2731598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2732125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659e700>} 2025-05-07T20:32:01.2733016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2733290Z context = 2025-05-07T20:32:01.2733295Z 2025-05-07T20:32:01.2733536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2733935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2734072Z module_map=module_map) 2025-05-07T20:32:01.2734394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2734507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2734721Z E ^ 2025-05-07T20:32:01.2735098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2735109Z 2025-05-07T20:32:01.2735592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2735596Z 2025-05-07T20:32:01.2735786Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2736033Z self=, 2025-05-07T20:32:01.2736213Z T=4096, 2025-05-07T20:32:01.2736332Z D=5120, 2025-05-07T20:32:01.2736445Z scale_ub=1200.0, 2025-05-07T20:32:01.2736617Z contiguous=True, 2025-05-07T20:32:01.2736729Z compiled=False, 2025-05-07T20:32:01.2736831Z ) 2025-05-07T20:32:01.2737153Z self = 2025-05-07T20:32:01.2737373Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2737378Z 2025-05-07T20:32:01.2737577Z @given( 2025-05-07T20:32:01.2737726Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2737857Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2738017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2738233Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2738388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2738544Z ) 2025-05-07T20:32:01.2738815Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2738936Z def test_silu_mul_quant( 2025-05-07T20:32:01.2739059Z self, 2025-05-07T20:32:01.2739216Z T: int, 2025-05-07T20:32:01.2739395Z D: int, 2025-05-07T20:32:01.2739521Z scale_ub: Optional[float], 2025-05-07T20:32:01.2739639Z contiguous: bool, 2025-05-07T20:32:01.2739789Z compiled: bool, 2025-05-07T20:32:01.2739883Z ) -> None: 2025-05-07T20:32:01.2740052Z torch.manual_seed(2025) 2025-05-07T20:32:01.2740223Z 2025-05-07T20:32:01.2740418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2740524Z 2025-05-07T20:32:01.2740680Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2740817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2741050Z x = x_sign * x_clamp 2025-05-07T20:32:01.2741159Z x0 = x[:, :D] 2025-05-07T20:32:01.2741269Z x1 = x[:, D:] 2025-05-07T20:32:01.2741404Z 2025-05-07T20:32:01.2741518Z if contiguous: 2025-05-07T20:32:01.2741624Z x0 = x0.contiguous() 2025-05-07T20:32:01.2741859Z x1 = x1.contiguous() 2025-05-07T20:32:01.2741961Z 2025-05-07T20:32:01.2742079Z if scale_ub is not None: 2025-05-07T20:32:01.2742246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2742412Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2742502Z ) 2025-05-07T20:32:01.2742725Z else: 2025-05-07T20:32:01.2742845Z scale_ub_tensor = None 2025-05-07T20:32:01.2743067Z 2025-05-07T20:32:01.2743224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2743362Z op = silu_mul_quant 2025-05-07T20:32:01.2743543Z if compiled: 2025-05-07T20:32:01.2743686Z op = torch.compile(op) 2025-05-07T20:32:01.2743820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2743953Z 2025-05-07T20:32:01.2744072Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2744076Z 2025-05-07T20:32:01.2744222Z moe/activation_test.py:117: 2025-05-07T20:32:01.2744446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2744591Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2744751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2745395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2745546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2745955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2746254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2746663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2746785Z kernel = self.compile( 2025-05-07T20:32:01.2747214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2747448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2747596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2747601Z 2025-05-07T20:32:01.2747931Z self = 2025-05-07T20:32:01.2748720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2749269Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659c360>} 2025-05-07T20:32:01.2750063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2750280Z context = 2025-05-07T20:32:01.2750286Z 2025-05-07T20:32:01.2750562Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2750864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2750999Z module_map=module_map) 2025-05-07T20:32:01.2751248Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2751373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2751512Z E ^ 2025-05-07T20:32:01.2751944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2751949Z 2025-05-07T20:32:01.2752397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2752403Z 2025-05-07T20:32:01.2752587Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2752835Z self=, 2025-05-07T20:32:01.2752980Z T=1, 2025-05-07T20:32:01.2753071Z D=5120, 2025-05-07T20:32:01.2753234Z scale_ub=None, 2025-05-07T20:32:01.2753418Z contiguous=True, 2025-05-07T20:32:01.2753529Z compiled=True, 2025-05-07T20:32:01.2753630Z ) 2025-05-07T20:32:01.2753990Z self = 2025-05-07T20:32:01.2754161Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2754166Z 2025-05-07T20:32:01.2754325Z @given( 2025-05-07T20:32:01.2754543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2754669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2754843Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2754990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2755118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2755351Z ) 2025-05-07T20:32:01.2755697Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2755821Z def test_silu_mul_quant( 2025-05-07T20:32:01.2755965Z self, 2025-05-07T20:32:01.2756098Z T: int, 2025-05-07T20:32:01.2756300Z D: int, 2025-05-07T20:32:01.2756453Z scale_ub: Optional[float], 2025-05-07T20:32:01.2756572Z contiguous: bool, 2025-05-07T20:32:01.2756722Z compiled: bool, 2025-05-07T20:32:01.2756830Z ) -> None: 2025-05-07T20:32:01.2756975Z torch.manual_seed(2025) 2025-05-07T20:32:01.2757150Z 2025-05-07T20:32:01.2757363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2757468Z 2025-05-07T20:32:01.2757623Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2757798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2757913Z x = x_sign * x_clamp 2025-05-07T20:32:01.2758094Z x0 = x[:, :D] 2025-05-07T20:32:01.2758252Z x1 = x[:, D:] 2025-05-07T20:32:01.2758390Z 2025-05-07T20:32:01.2758512Z if contiguous: 2025-05-07T20:32:01.2758656Z x0 = x0.contiguous() 2025-05-07T20:32:01.2758795Z x1 = x1.contiguous() 2025-05-07T20:32:01.2758945Z 2025-05-07T20:32:01.2759083Z if scale_ub is not None: 2025-05-07T20:32:01.2759262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2759449Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2759555Z ) 2025-05-07T20:32:01.2759682Z else: 2025-05-07T20:32:01.2759855Z scale_ub_tensor = None 2025-05-07T20:32:01.2760011Z 2025-05-07T20:32:01.2760169Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2760313Z op = silu_mul_quant 2025-05-07T20:32:01.2760460Z if compiled: 2025-05-07T20:32:01.2760575Z op = torch.compile(op) 2025-05-07T20:32:01.2760756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2760915Z 2025-05-07T20:32:01.2761058Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2761213Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2761348Z 2025-05-07T20:32:01.2761498Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2761758Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2761891Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2762065Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2762271Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2762374Z 2025-05-07T20:32:01.2762493Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2762498Z 2025-05-07T20:32:01.2762722Z moe/activation_test.py:126: 2025-05-07T20:32:01.2762905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2763037Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2763235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2763820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2764018Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2764528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2764781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2765207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2765488Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2765908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2766154Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2766639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2766783Z fn() 2025-05-07T20:32:01.2767211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2767393Z self.fn.run( 2025-05-07T20:32:01.2767763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2767942Z kernel = self.compile( 2025-05-07T20:32:01.2768403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2768604Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2768763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2768806Z 2025-05-07T20:32:01.2769083Z self = 2025-05-07T20:32:01.2769870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2770506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a659ef20>} 2025-05-07T20:32:01.2771265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2771515Z context = 2025-05-07T20:32:01.2771521Z 2025-05-07T20:32:01.2771714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2772027Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2772240Z module_map=module_map) 2025-05-07T20:32:01.2772447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2772611Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2772724Z E ^ 2025-05-07T20:32:01.2773126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2773131Z 2025-05-07T20:32:01.2773587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2773592Z 2025-05-07T20:32:01.2773864Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2774166Z self=, 2025-05-07T20:32:01.2774272Z T=2048, 2025-05-07T20:32:01.2774377Z D=5120, 2025-05-07T20:32:01.2774577Z scale_ub=None, 2025-05-07T20:32:01.2774676Z contiguous=True, 2025-05-07T20:32:01.2774848Z compiled=True, 2025-05-07T20:32:01.2775001Z ) 2025-05-07T20:32:01.2775250Z self = 2025-05-07T20:32:01.2775469Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2775561Z 2025-05-07T20:32:01.2775704Z @given( 2025-05-07T20:32:01.2775838Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2776082Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2776225Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2776395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2776577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2776682Z ) 2025-05-07T20:32:01.2776940Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2777190Z def test_silu_mul_quant( 2025-05-07T20:32:01.2777321Z self, 2025-05-07T20:32:01.2777546Z T: int, 2025-05-07T20:32:01.2777661Z D: int, 2025-05-07T20:32:01.2777789Z scale_ub: Optional[float], 2025-05-07T20:32:01.2777979Z contiguous: bool, 2025-05-07T20:32:01.2778136Z compiled: bool, 2025-05-07T20:32:01.2778254Z ) -> None: 2025-05-07T20:32:01.2778439Z torch.manual_seed(2025) 2025-05-07T20:32:01.2778542Z 2025-05-07T20:32:01.2778737Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2778912Z 2025-05-07T20:32:01.2779075Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2779228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2779386Z x = x_sign * x_clamp 2025-05-07T20:32:01.2779497Z x0 = x[:, :D] 2025-05-07T20:32:01.2779630Z x1 = x[:, D:] 2025-05-07T20:32:01.2779810Z 2025-05-07T20:32:01.2779941Z if contiguous: 2025-05-07T20:32:01.2780093Z x0 = x0.contiguous() 2025-05-07T20:32:01.2780213Z x1 = x1.contiguous() 2025-05-07T20:32:01.2780325Z 2025-05-07T20:32:01.2780467Z if scale_ub is not None: 2025-05-07T20:32:01.2780669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2780846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2780989Z ) 2025-05-07T20:32:01.2781094Z else: 2025-05-07T20:32:01.2781252Z scale_ub_tensor = None 2025-05-07T20:32:01.2781362Z 2025-05-07T20:32:01.2781569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2781763Z op = silu_mul_quant 2025-05-07T20:32:01.2781880Z if compiled: 2025-05-07T20:32:01.2782012Z op = torch.compile(op) 2025-05-07T20:32:01.2782206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2782295Z 2025-05-07T20:32:01.2782467Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2782665Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2782768Z 2025-05-07T20:32:01.2782936Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2783122Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2783237Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2783493Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2783663Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2783767Z 2025-05-07T20:32:01.2783952Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2783957Z 2025-05-07T20:32:01.2784083Z moe/activation_test.py:126: 2025-05-07T20:32:01.2784325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2784478Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2784640Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2785254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2785412Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2785803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2786134Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2786656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2787003Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2787404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2787599Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2787986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2788145Z fn() 2025-05-07T20:32:01.2788725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2788837Z self.fn.run( 2025-05-07T20:32:01.2789198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2789361Z kernel = self.compile( 2025-05-07T20:32:01.2789753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2789994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2790225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2790230Z 2025-05-07T20:32:01.2790465Z self = 2025-05-07T20:32:01.2791298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2791825Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a651ab60>} 2025-05-07T20:32:01.2792689Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2800231Z context = 2025-05-07T20:32:01.2800244Z 2025-05-07T20:32:01.2800444Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2800707Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2800826Z module_map=module_map) 2025-05-07T20:32:01.2801001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2801106Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2801192Z E ^ 2025-05-07T20:32:01.2801544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2801553Z 2025-05-07T20:32:01.2801975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2801980Z 2025-05-07T20:32:01.2802084Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2802307Z self=, 2025-05-07T20:32:01.2802395Z T=128, 2025-05-07T20:32:01.2802473Z D=5120, 2025-05-07T20:32:01.2802556Z scale_ub=None, 2025-05-07T20:32:01.2802651Z contiguous=True, 2025-05-07T20:32:01.2802734Z compiled=True, 2025-05-07T20:32:01.2802809Z ) 2025-05-07T20:32:01.2803037Z self = 2025-05-07T20:32:01.2803205Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2803209Z 2025-05-07T20:32:01.2803298Z @given( 2025-05-07T20:32:01.2803418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2803793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2803916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2804033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2804148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2804231Z ) 2025-05-07T20:32:01.2804474Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2804569Z def test_silu_mul_quant( 2025-05-07T20:32:01.2804654Z self, 2025-05-07T20:32:01.2804732Z T: int, 2025-05-07T20:32:01.2804819Z D: int, 2025-05-07T20:32:01.2804919Z scale_ub: Optional[float], 2025-05-07T20:32:01.2805130Z contiguous: bool, 2025-05-07T20:32:01.2805225Z compiled: bool, 2025-05-07T20:32:01.2805306Z ) -> None: 2025-05-07T20:32:01.2805402Z torch.manual_seed(2025) 2025-05-07T20:32:01.2805487Z 2025-05-07T20:32:01.2805656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2805740Z 2025-05-07T20:32:01.2805844Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2805968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2806057Z x = x_sign * x_clamp 2025-05-07T20:32:01.2806149Z x0 = x[:, :D] 2025-05-07T20:32:01.2806233Z x1 = x[:, D:] 2025-05-07T20:32:01.2806307Z 2025-05-07T20:32:01.2806399Z if contiguous: 2025-05-07T20:32:01.2806492Z x0 = x0.contiguous() 2025-05-07T20:32:01.2806592Z x1 = x1.contiguous() 2025-05-07T20:32:01.2806666Z 2025-05-07T20:32:01.2806757Z if scale_ub is not None: 2025-05-07T20:32:01.2806869Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2807008Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2807085Z ) 2025-05-07T20:32:01.2807169Z else: 2025-05-07T20:32:01.2807262Z scale_ub_tensor = None 2025-05-07T20:32:01.2807342Z 2025-05-07T20:32:01.2807478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2807568Z op = silu_mul_quant 2025-05-07T20:32:01.2807653Z if compiled: 2025-05-07T20:32:01.2807763Z op = torch.compile(op) 2025-05-07T20:32:01.2807868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2807948Z 2025-05-07T20:32:01.2808038Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2808158Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2808239Z 2025-05-07T20:32:01.2808373Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2808473Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2808585Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2808708Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2808846Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2808930Z 2025-05-07T20:32:01.2809030Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2809035Z 2025-05-07T20:32:01.2809138Z moe/activation_test.py:126: 2025-05-07T20:32:01.2809268Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2809372Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2809513Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2810062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2810163Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2810534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2810755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2811124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2811465Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2811838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2812012Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2812351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2812435Z fn() 2025-05-07T20:32:01.2812829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2812908Z self.fn.run( 2025-05-07T20:32:01.2813314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2813420Z kernel = self.compile( 2025-05-07T20:32:01.2813895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2814077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2814213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2814218Z 2025-05-07T20:32:01.2814420Z self = 2025-05-07T20:32:01.2815196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2815699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a6c42700>} 2025-05-07T20:32:01.2816432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2816635Z context = 2025-05-07T20:32:01.2816640Z 2025-05-07T20:32:01.2816803Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2817070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2817176Z module_map=module_map) 2025-05-07T20:32:01.2817337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2817446Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2817523Z E ^ 2025-05-07T20:32:01.2817875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2817888Z 2025-05-07T20:32:01.2818298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2818306Z 2025-05-07T20:32:01.2818407Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2818633Z self=, 2025-05-07T20:32:01.2818710Z T=4096, 2025-05-07T20:32:01.2818787Z D=5120, 2025-05-07T20:32:01.2818877Z scale_ub=None, 2025-05-07T20:32:01.2818962Z contiguous=True, 2025-05-07T20:32:01.2819045Z compiled=True, 2025-05-07T20:32:01.2819124Z ) 2025-05-07T20:32:01.2819344Z self = 2025-05-07T20:32:01.2819517Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2819521Z 2025-05-07T20:32:01.2819602Z @given( 2025-05-07T20:32:01.2819721Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2819826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2819941Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2820141Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2820259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2820333Z ) 2025-05-07T20:32:01.2820577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2820677Z def test_silu_mul_quant( 2025-05-07T20:32:01.2820753Z self, 2025-05-07T20:32:01.2820836Z T: int, 2025-05-07T20:32:01.2820911Z D: int, 2025-05-07T20:32:01.2821008Z scale_ub: Optional[float], 2025-05-07T20:32:01.2821106Z contiguous: bool, 2025-05-07T20:32:01.2821190Z compiled: bool, 2025-05-07T20:32:01.2821269Z ) -> None: 2025-05-07T20:32:01.2821448Z torch.manual_seed(2025) 2025-05-07T20:32:01.2821522Z 2025-05-07T20:32:01.2821690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2821768Z 2025-05-07T20:32:01.2821859Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2821989Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2822083Z x = x_sign * x_clamp 2025-05-07T20:32:01.2822164Z x0 = x[:, :D] 2025-05-07T20:32:01.2822250Z x1 = x[:, D:] 2025-05-07T20:32:01.2822323Z 2025-05-07T20:32:01.2822405Z if contiguous: 2025-05-07T20:32:01.2822503Z x0 = x0.contiguous() 2025-05-07T20:32:01.2822593Z x1 = x1.contiguous() 2025-05-07T20:32:01.2822665Z 2025-05-07T20:32:01.2822761Z if scale_ub is not None: 2025-05-07T20:32:01.2822867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2823004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2823086Z ) 2025-05-07T20:32:01.2823170Z else: 2025-05-07T20:32:01.2823264Z scale_ub_tensor = None 2025-05-07T20:32:01.2823342Z 2025-05-07T20:32:01.2823474Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2823564Z op = silu_mul_quant 2025-05-07T20:32:01.2823658Z if compiled: 2025-05-07T20:32:01.2823756Z op = torch.compile(op) 2025-05-07T20:32:01.2823867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2823941Z 2025-05-07T20:32:01.2824033Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2824159Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2824231Z 2025-05-07T20:32:01.2824365Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2824469Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2824566Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2824687Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2824837Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2824916Z 2025-05-07T20:32:01.2825023Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2825027Z 2025-05-07T20:32:01.2825126Z moe/activation_test.py:126: 2025-05-07T20:32:01.2825258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2825392Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2825540Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2826098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2826203Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2826560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2826788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2827156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2827410Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2827933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2828098Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2828442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2828521Z fn() 2025-05-07T20:32:01.2828918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2829007Z self.fn.run( 2025-05-07T20:32:01.2829343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2829535Z kernel = self.compile( 2025-05-07T20:32:01.2829924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2830098Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2830243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2830248Z 2025-05-07T20:32:01.2830453Z self = 2025-05-07T20:32:01.2831217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2831720Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55a3600>} 2025-05-07T20:32:01.2832460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2832662Z context = 2025-05-07T20:32:01.2832667Z 2025-05-07T20:32:01.2832830Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2833088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2833203Z module_map=module_map) 2025-05-07T20:32:01.2833364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2833473Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2833549Z E ^ 2025-05-07T20:32:01.2833901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2833906Z 2025-05-07T20:32:01.2834327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2834332Z 2025-05-07T20:32:01.2834435Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2834666Z self=, 2025-05-07T20:32:01.2834745Z T=16384, 2025-05-07T20:32:01.2834822Z D=5120, 2025-05-07T20:32:01.2834912Z scale_ub=None, 2025-05-07T20:32:01.2834998Z contiguous=True, 2025-05-07T20:32:01.2835080Z compiled=True, 2025-05-07T20:32:01.2835161Z ) 2025-05-07T20:32:01.2835376Z self = 2025-05-07T20:32:01.2835547Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.2835552Z 2025-05-07T20:32:01.2835635Z @given( 2025-05-07T20:32:01.2835755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2835858Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2835979Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2836095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2836214Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2836374Z ) 2025-05-07T20:32:01.2836617Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2836716Z def test_silu_mul_quant( 2025-05-07T20:32:01.2836793Z self, 2025-05-07T20:32:01.2836871Z T: int, 2025-05-07T20:32:01.2836953Z D: int, 2025-05-07T20:32:01.2837052Z scale_ub: Optional[float], 2025-05-07T20:32:01.2837141Z contiguous: bool, 2025-05-07T20:32:01.2837232Z compiled: bool, 2025-05-07T20:32:01.2837310Z ) -> None: 2025-05-07T20:32:01.2837411Z torch.manual_seed(2025) 2025-05-07T20:32:01.2837483Z 2025-05-07T20:32:01.2837648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2837805Z 2025-05-07T20:32:01.2837897Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2838020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2838116Z x = x_sign * x_clamp 2025-05-07T20:32:01.2838203Z x0 = x[:, :D] 2025-05-07T20:32:01.2838283Z x1 = x[:, D:] 2025-05-07T20:32:01.2838360Z 2025-05-07T20:32:01.2838443Z if contiguous: 2025-05-07T20:32:01.2838537Z x0 = x0.contiguous() 2025-05-07T20:32:01.2838632Z x1 = x1.contiguous() 2025-05-07T20:32:01.2838704Z 2025-05-07T20:32:01.2838795Z if scale_ub is not None: 2025-05-07T20:32:01.2838911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2839046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2839129Z ) 2025-05-07T20:32:01.2839205Z else: 2025-05-07T20:32:01.2839299Z scale_ub_tensor = None 2025-05-07T20:32:01.2839377Z 2025-05-07T20:32:01.2839510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2839599Z op = silu_mul_quant 2025-05-07T20:32:01.2839690Z if compiled: 2025-05-07T20:32:01.2839789Z op = torch.compile(op) 2025-05-07T20:32:01.2839899Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2839979Z 2025-05-07T20:32:01.2840069Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2840187Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2840265Z 2025-05-07T20:32:01.2840402Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2840508Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2840606Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2840728Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2840872Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2840944Z 2025-05-07T20:32:01.2841049Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2841054Z 2025-05-07T20:32:01.2841160Z moe/activation_test.py:126: 2025-05-07T20:32:01.2841287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2841397Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2841534Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2842084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2842190Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2842549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2842770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2843140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2843397Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2843777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2844027Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2844368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2844452Z fn() 2025-05-07T20:32:01.2844850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2844934Z self.fn.run( 2025-05-07T20:32:01.2845277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2845369Z kernel = self.compile( 2025-05-07T20:32:01.2845828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2846002Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2846131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2846142Z 2025-05-07T20:32:01.2846352Z self = 2025-05-07T20:32:01.2847114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2847618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55eaac0>} 2025-05-07T20:32:01.2848359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2848549Z context = 2025-05-07T20:32:01.2848560Z 2025-05-07T20:32:01.2848727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2848993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2849107Z module_map=module_map) 2025-05-07T20:32:01.2849269Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2849373Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2849459Z E ^ 2025-05-07T20:32:01.2849810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2849815Z 2025-05-07T20:32:01.2850228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2850236Z 2025-05-07T20:32:01.2850340Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2850563Z self=, 2025-05-07T20:32:01.2850648Z T=1, 2025-05-07T20:32:01.2850730Z D=5120, 2025-05-07T20:32:01.2850817Z scale_ub=1200.0, 2025-05-07T20:32:01.2850910Z contiguous=True, 2025-05-07T20:32:01.2850995Z compiled=True, 2025-05-07T20:32:01.2851071Z ) 2025-05-07T20:32:01.2851292Z self = 2025-05-07T20:32:01.2851455Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.2851459Z 2025-05-07T20:32:01.2851541Z @given( 2025-05-07T20:32:01.2851661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2851759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2851883Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2852005Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2852120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2852202Z ) 2025-05-07T20:32:01.2852446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2852630Z def test_silu_mul_quant( 2025-05-07T20:32:01.2852708Z self, 2025-05-07T20:32:01.2852787Z T: int, 2025-05-07T20:32:01.2852872Z D: int, 2025-05-07T20:32:01.2852970Z scale_ub: Optional[float], 2025-05-07T20:32:01.2853061Z contiguous: bool, 2025-05-07T20:32:01.2853155Z compiled: bool, 2025-05-07T20:32:01.2853235Z ) -> None: 2025-05-07T20:32:01.2853330Z torch.manual_seed(2025) 2025-05-07T20:32:01.2853412Z 2025-05-07T20:32:01.2853581Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2853726Z 2025-05-07T20:32:01.2853826Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2854028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2854119Z x = x_sign * x_clamp 2025-05-07T20:32:01.2854206Z x0 = x[:, :D] 2025-05-07T20:32:01.2854286Z x1 = x[:, D:] 2025-05-07T20:32:01.2854364Z 2025-05-07T20:32:01.2854448Z if contiguous: 2025-05-07T20:32:01.2854545Z x0 = x0.contiguous() 2025-05-07T20:32:01.2854639Z x1 = x1.contiguous() 2025-05-07T20:32:01.2854713Z 2025-05-07T20:32:01.2854801Z if scale_ub is not None: 2025-05-07T20:32:01.2854912Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2855045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2855120Z ) 2025-05-07T20:32:01.2855201Z else: 2025-05-07T20:32:01.2855296Z scale_ub_tensor = None 2025-05-07T20:32:01.2855368Z 2025-05-07T20:32:01.2855501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2855591Z op = silu_mul_quant 2025-05-07T20:32:01.2855680Z if compiled: 2025-05-07T20:32:01.2855785Z op = torch.compile(op) 2025-05-07T20:32:01.2855891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2855971Z 2025-05-07T20:32:01.2856060Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2856071Z 2025-05-07T20:32:01.2856167Z moe/activation_test.py:117: 2025-05-07T20:32:01.2856303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2856403Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2856503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2856871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2856963Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2857457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2857555Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2857915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2858145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2858484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2858585Z kernel = self.compile( 2025-05-07T20:32:01.2858963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2859134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2859269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2859273Z 2025-05-07T20:32:01.2859477Z self = 2025-05-07T20:32:01.2860245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2860744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4a4dbc0>} 2025-05-07T20:32:01.2861581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2861770Z context = 2025-05-07T20:32:01.2861774Z 2025-05-07T20:32:01.2861946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2862203Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2862416Z module_map=module_map) 2025-05-07T20:32:01.2862578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2862675Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2862759Z E ^ 2025-05-07T20:32:01.2863108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2863118Z 2025-05-07T20:32:01.2863527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2863532Z 2025-05-07T20:32:01.2863642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2863862Z self=, 2025-05-07T20:32:01.2863946Z T=1, 2025-05-07T20:32:01.2864024Z D=5120, 2025-05-07T20:32:01.2864108Z scale_ub=None, 2025-05-07T20:32:01.2864201Z contiguous=False, 2025-05-07T20:32:01.2864286Z compiled=True, 2025-05-07T20:32:01.2864359Z ) 2025-05-07T20:32:01.2864585Z self = 2025-05-07T20:32:01.2864747Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.2864752Z 2025-05-07T20:32:01.2864828Z @given( 2025-05-07T20:32:01.2864962Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2865062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2865183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2865300Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2865415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2865496Z ) 2025-05-07T20:32:01.2865739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2865833Z def test_silu_mul_quant( 2025-05-07T20:32:01.2865919Z self, 2025-05-07T20:32:01.2865996Z T: int, 2025-05-07T20:32:01.2866074Z D: int, 2025-05-07T20:32:01.2866184Z scale_ub: Optional[float], 2025-05-07T20:32:01.2866274Z contiguous: bool, 2025-05-07T20:32:01.2866359Z compiled: bool, 2025-05-07T20:32:01.2866445Z ) -> None: 2025-05-07T20:32:01.2866538Z torch.manual_seed(2025) 2025-05-07T20:32:01.2866622Z 2025-05-07T20:32:01.2866787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2866859Z 2025-05-07T20:32:01.2866955Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2867079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2867167Z x = x_sign * x_clamp 2025-05-07T20:32:01.2867253Z x0 = x[:, :D] 2025-05-07T20:32:01.2867332Z x1 = x[:, D:] 2025-05-07T20:32:01.2867405Z 2025-05-07T20:32:01.2867493Z if contiguous: 2025-05-07T20:32:01.2867584Z x0 = x0.contiguous() 2025-05-07T20:32:01.2867672Z x1 = x1.contiguous() 2025-05-07T20:32:01.2867753Z 2025-05-07T20:32:01.2867842Z if scale_ub is not None: 2025-05-07T20:32:01.2867958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2868092Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2868167Z ) 2025-05-07T20:32:01.2868250Z else: 2025-05-07T20:32:01.2868432Z scale_ub_tensor = None 2025-05-07T20:32:01.2868504Z 2025-05-07T20:32:01.2868638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2868728Z op = silu_mul_quant 2025-05-07T20:32:01.2868812Z if compiled: 2025-05-07T20:32:01.2868920Z op = torch.compile(op) 2025-05-07T20:32:01.2869026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2869098Z 2025-05-07T20:32:01.2869196Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2869316Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2869396Z 2025-05-07T20:32:01.2869531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2869709Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2869815Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2869936Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2870074Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2870159Z 2025-05-07T20:32:01.2870260Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2870264Z 2025-05-07T20:32:01.2870361Z moe/activation_test.py:126: 2025-05-07T20:32:01.2870494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2870598Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2870738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2871286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2871387Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2871753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2871977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2872348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2872607Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2872978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2873152Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2873489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2873567Z fn() 2025-05-07T20:32:01.2873979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2874061Z self.fn.run( 2025-05-07T20:32:01.2874404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2874496Z kernel = self.compile( 2025-05-07T20:32:01.2874878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2875058Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2875189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2875193Z 2025-05-07T20:32:01.2875396Z self = 2025-05-07T20:32:01.2876168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2876670Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c30ae0>} 2025-05-07T20:32:01.2877412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2877695Z context = 2025-05-07T20:32:01.2877701Z 2025-05-07T20:32:01.2877928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2878292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2878407Z module_map=module_map) 2025-05-07T20:32:01.2878575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2878676Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2878843Z E ^ 2025-05-07T20:32:01.2879199Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2879204Z 2025-05-07T20:32:01.2879617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2879621Z 2025-05-07T20:32:01.2879729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2879948Z self=, 2025-05-07T20:32:01.2880023Z T=1, 2025-05-07T20:32:01.2880107Z D=5120, 2025-05-07T20:32:01.2880189Z scale_ub=None, 2025-05-07T20:32:01.2880274Z contiguous=True, 2025-05-07T20:32:01.2880365Z compiled=False, 2025-05-07T20:32:01.2880444Z ) 2025-05-07T20:32:01.2880665Z self = 2025-05-07T20:32:01.2880832Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.2880837Z 2025-05-07T20:32:01.2880917Z @given( 2025-05-07T20:32:01.2881042Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2881141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2881260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2881384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2881496Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2881570Z ) 2025-05-07T20:32:01.2881816Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2881908Z def test_silu_mul_quant( 2025-05-07T20:32:01.2881991Z self, 2025-05-07T20:32:01.2882072Z T: int, 2025-05-07T20:32:01.2882147Z D: int, 2025-05-07T20:32:01.2882249Z scale_ub: Optional[float], 2025-05-07T20:32:01.2882337Z contiguous: bool, 2025-05-07T20:32:01.2882422Z compiled: bool, 2025-05-07T20:32:01.2882510Z ) -> None: 2025-05-07T20:32:01.2882605Z torch.manual_seed(2025) 2025-05-07T20:32:01.2882677Z 2025-05-07T20:32:01.2882849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2882926Z 2025-05-07T20:32:01.2883017Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2883145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2883233Z x = x_sign * x_clamp 2025-05-07T20:32:01.2883318Z x0 = x[:, :D] 2025-05-07T20:32:01.2883398Z x1 = x[:, D:] 2025-05-07T20:32:01.2883471Z 2025-05-07T20:32:01.2883562Z if contiguous: 2025-05-07T20:32:01.2883652Z x0 = x0.contiguous() 2025-05-07T20:32:01.2883740Z x1 = x1.contiguous() 2025-05-07T20:32:01.2883817Z 2025-05-07T20:32:01.2883908Z if scale_ub is not None: 2025-05-07T20:32:01.2884011Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2884151Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2884234Z ) 2025-05-07T20:32:01.2884313Z else: 2025-05-07T20:32:01.2884413Z scale_ub_tensor = None 2025-05-07T20:32:01.2884485Z 2025-05-07T20:32:01.2884613Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2884796Z op = silu_mul_quant 2025-05-07T20:32:01.2884880Z if compiled: 2025-05-07T20:32:01.2884986Z op = torch.compile(op) 2025-05-07T20:32:01.2885091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2885168Z 2025-05-07T20:32:01.2885263Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2885267Z 2025-05-07T20:32:01.2885365Z moe/activation_test.py:117: 2025-05-07T20:32:01.2885494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2885601Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2885701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2886272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2886369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2886723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2886952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2887289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2887380Z kernel = self.compile( 2025-05-07T20:32:01.2887762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2887934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2888066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2888071Z 2025-05-07T20:32:01.2888277Z self = 2025-05-07T20:32:01.2889040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2889546Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a6f956c0>} 2025-05-07T20:32:01.2890276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2890469Z context = 2025-05-07T20:32:01.2890474Z 2025-05-07T20:32:01.2890636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2890901Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2891009Z module_map=module_map) 2025-05-07T20:32:01.2891169Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2891276Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2891352Z E ^ 2025-05-07T20:32:01.2891700Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2891704Z 2025-05-07T20:32:01.2892118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2892123Z 2025-05-07T20:32:01.2892224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2892449Z self=, 2025-05-07T20:32:01.2892526Z T=128, 2025-05-07T20:32:01.2892602Z D=5120, 2025-05-07T20:32:01.2892697Z scale_ub=None, 2025-05-07T20:32:01.2892786Z contiguous=False, 2025-05-07T20:32:01.2892871Z compiled=True, 2025-05-07T20:32:01.2892953Z ) 2025-05-07T20:32:01.2893166Z self = 2025-05-07T20:32:01.2893444Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.2893448Z 2025-05-07T20:32:01.2893536Z @given( 2025-05-07T20:32:01.2893772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2893922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2894072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2894189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2894307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2894382Z ) 2025-05-07T20:32:01.2894622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2894720Z def test_silu_mul_quant( 2025-05-07T20:32:01.2894880Z self, 2025-05-07T20:32:01.2894962Z T: int, 2025-05-07T20:32:01.2895046Z D: int, 2025-05-07T20:32:01.2895142Z scale_ub: Optional[float], 2025-05-07T20:32:01.2895237Z contiguous: bool, 2025-05-07T20:32:01.2895333Z compiled: bool, 2025-05-07T20:32:01.2895408Z ) -> None: 2025-05-07T20:32:01.2895508Z torch.manual_seed(2025) 2025-05-07T20:32:01.2895580Z 2025-05-07T20:32:01.2895746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2895824Z 2025-05-07T20:32:01.2895916Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2896038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2896128Z x = x_sign * x_clamp 2025-05-07T20:32:01.2896206Z x0 = x[:, :D] 2025-05-07T20:32:01.2896284Z x1 = x[:, D:] 2025-05-07T20:32:01.2896360Z 2025-05-07T20:32:01.2896441Z if contiguous: 2025-05-07T20:32:01.2896535Z x0 = x0.contiguous() 2025-05-07T20:32:01.2896626Z x1 = x1.contiguous() 2025-05-07T20:32:01.2896701Z 2025-05-07T20:32:01.2896797Z if scale_ub is not None: 2025-05-07T20:32:01.2896898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2897034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2897113Z ) 2025-05-07T20:32:01.2897187Z else: 2025-05-07T20:32:01.2897278Z scale_ub_tensor = None 2025-05-07T20:32:01.2897356Z 2025-05-07T20:32:01.2897488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2897575Z op = silu_mul_quant 2025-05-07T20:32:01.2897661Z if compiled: 2025-05-07T20:32:01.2897756Z op = torch.compile(op) 2025-05-07T20:32:01.2897858Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2897936Z 2025-05-07T20:32:01.2898023Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2898028Z 2025-05-07T20:32:01.2898135Z moe/activation_test.py:117: 2025-05-07T20:32:01.2898545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2898672Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2898775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2899142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2899232Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2899722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2899816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2900171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2900386Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2900720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2900816Z kernel = self.compile( 2025-05-07T20:32:01.2901192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2901576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2901706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2901712Z 2025-05-07T20:32:01.2901911Z self = 2025-05-07T20:32:01.2902673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2903281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c32340>} 2025-05-07T20:32:01.2904014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2904204Z context = 2025-05-07T20:32:01.2904209Z 2025-05-07T20:32:01.2904370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2904633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2904739Z module_map=module_map) 2025-05-07T20:32:01.2904903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2905001Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2905078Z E ^ 2025-05-07T20:32:01.2905439Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2905443Z 2025-05-07T20:32:01.2905845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2905854Z 2025-05-07T20:32:01.2905955Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2906177Z self=, 2025-05-07T20:32:01.2906254Z T=128, 2025-05-07T20:32:01.2906333Z D=7168, 2025-05-07T20:32:01.2906415Z scale_ub=1200.0, 2025-05-07T20:32:01.2906501Z contiguous=False, 2025-05-07T20:32:01.2906595Z compiled=False, 2025-05-07T20:32:01.2906668Z ) 2025-05-07T20:32:01.2906880Z self = 2025-05-07T20:32:01.2907052Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.2907057Z 2025-05-07T20:32:01.2907133Z @given( 2025-05-07T20:32:01.2907257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2907361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2907474Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2907598Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2907714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2907785Z ) 2025-05-07T20:32:01.2908029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2908123Z def test_silu_mul_quant( 2025-05-07T20:32:01.2908196Z self, 2025-05-07T20:32:01.2908277Z T: int, 2025-05-07T20:32:01.2908352Z D: int, 2025-05-07T20:32:01.2908447Z scale_ub: Optional[float], 2025-05-07T20:32:01.2908543Z contiguous: bool, 2025-05-07T20:32:01.2908626Z compiled: bool, 2025-05-07T20:32:01.2908700Z ) -> None: 2025-05-07T20:32:01.2908797Z torch.manual_seed(2025) 2025-05-07T20:32:01.2908871Z 2025-05-07T20:32:01.2909046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2909117Z 2025-05-07T20:32:01.2909205Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2909335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2909507Z x = x_sign * x_clamp 2025-05-07T20:32:01.2909585Z x0 = x[:, :D] 2025-05-07T20:32:01.2909669Z x1 = x[:, D:] 2025-05-07T20:32:01.2909740Z 2025-05-07T20:32:01.2909823Z if contiguous: 2025-05-07T20:32:01.2909918Z x0 = x0.contiguous() 2025-05-07T20:32:01.2910003Z x1 = x1.contiguous() 2025-05-07T20:32:01.2910073Z 2025-05-07T20:32:01.2910169Z if scale_ub is not None: 2025-05-07T20:32:01.2910272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2910408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2910481Z ) 2025-05-07T20:32:01.2910557Z else: 2025-05-07T20:32:01.2910730Z scale_ub_tensor = None 2025-05-07T20:32:01.2910801Z 2025-05-07T20:32:01.2910926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2911017Z op = silu_mul_quant 2025-05-07T20:32:01.2911098Z if compiled: 2025-05-07T20:32:01.2911199Z op = torch.compile(op) 2025-05-07T20:32:01.2911307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2911378Z 2025-05-07T20:32:01.2911465Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2911476Z 2025-05-07T20:32:01.2911569Z moe/activation_test.py:117: 2025-05-07T20:32:01.2911694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2911794Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2911889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2912375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2912481Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2912832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2913049Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2913392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2913479Z kernel = self.compile( 2025-05-07T20:32:01.2913859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2914025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2914148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2914152Z 2025-05-07T20:32:01.2914357Z self = 2025-05-07T20:32:01.2915117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2915614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a5c30220>} 2025-05-07T20:32:01.2916346Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2916537Z context = 2025-05-07T20:32:01.2916541Z 2025-05-07T20:32:01.2916702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2916956Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2917073Z module_map=module_map) 2025-05-07T20:32:01.2917231Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2917328Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2917491Z E ^ 2025-05-07T20:32:01.2917834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2917838Z 2025-05-07T20:32:01.2918245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2918249Z 2025-05-07T20:32:01.2918349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2918565Z self=, 2025-05-07T20:32:01.2918643Z T=128, 2025-05-07T20:32:01.2918717Z D=5120, 2025-05-07T20:32:01.2918796Z scale_ub=None, 2025-05-07T20:32:01.2918886Z contiguous=False, 2025-05-07T20:32:01.2918968Z compiled=False, 2025-05-07T20:32:01.2919116Z ) 2025-05-07T20:32:01.2919329Z self = 2025-05-07T20:32:01.2919495Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.2919505Z 2025-05-07T20:32:01.2919587Z @given( 2025-05-07T20:32:01.2919707Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2919801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2919919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2920032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2920141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2920220Z ) 2025-05-07T20:32:01.2920459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2920555Z def test_silu_mul_quant( 2025-05-07T20:32:01.2920633Z self, 2025-05-07T20:32:01.2920705Z T: int, 2025-05-07T20:32:01.2920791Z D: int, 2025-05-07T20:32:01.2920885Z scale_ub: Optional[float], 2025-05-07T20:32:01.2920980Z contiguous: bool, 2025-05-07T20:32:01.2925857Z compiled: bool, 2025-05-07T20:32:01.2925955Z ) -> None: 2025-05-07T20:32:01.2926064Z torch.manual_seed(2025) 2025-05-07T20:32:01.2926147Z 2025-05-07T20:32:01.2926319Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2926397Z 2025-05-07T20:32:01.2926499Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2926624Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2926715Z x = x_sign * x_clamp 2025-05-07T20:32:01.2926805Z x0 = x[:, :D] 2025-05-07T20:32:01.2926888Z x1 = x[:, D:] 2025-05-07T20:32:01.2926968Z 2025-05-07T20:32:01.2927055Z if contiguous: 2025-05-07T20:32:01.2927148Z x0 = x0.contiguous() 2025-05-07T20:32:01.2927244Z x1 = x1.contiguous() 2025-05-07T20:32:01.2927322Z 2025-05-07T20:32:01.2927419Z if scale_ub is not None: 2025-05-07T20:32:01.2927533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2927669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2927753Z ) 2025-05-07T20:32:01.2927841Z else: 2025-05-07T20:32:01.2927942Z scale_ub_tensor = None 2025-05-07T20:32:01.2928016Z 2025-05-07T20:32:01.2928158Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2928253Z op = silu_mul_quant 2025-05-07T20:32:01.2928347Z if compiled: 2025-05-07T20:32:01.2928449Z op = torch.compile(op) 2025-05-07T20:32:01.2928557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2928637Z 2025-05-07T20:32:01.2928730Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2928734Z 2025-05-07T20:32:01.2928833Z moe/activation_test.py:117: 2025-05-07T20:32:01.2928975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2929081Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2929184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2929691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2929935Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2930297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2930520Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2930857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2930964Z kernel = self.compile( 2025-05-07T20:32:01.2931343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2931595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2931733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2931737Z 2025-05-07T20:32:01.2931944Z self = 2025-05-07T20:32:01.2932727Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2933227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4a4c900>} 2025-05-07T20:32:01.2934091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2934287Z context = 2025-05-07T20:32:01.2934291Z 2025-05-07T20:32:01.2934455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2934721Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2934837Z module_map=module_map) 2025-05-07T20:32:01.2935009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2935110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2935188Z E ^ 2025-05-07T20:32:01.2935547Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2935551Z 2025-05-07T20:32:01.2935960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2935964Z 2025-05-07T20:32:01.2936076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2936301Z self=, 2025-05-07T20:32:01.2936380Z T=128, 2025-05-07T20:32:01.2936467Z D=5120, 2025-05-07T20:32:01.2936553Z scale_ub=1200.0, 2025-05-07T20:32:01.2936643Z contiguous=True, 2025-05-07T20:32:01.2936736Z compiled=False, 2025-05-07T20:32:01.2936811Z ) 2025-05-07T20:32:01.2937029Z self = 2025-05-07T20:32:01.2937205Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.2937209Z 2025-05-07T20:32:01.2937288Z @given( 2025-05-07T20:32:01.2937409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2937521Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2937637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2937764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2937884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2937960Z ) 2025-05-07T20:32:01.2938213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2938309Z def test_silu_mul_quant( 2025-05-07T20:32:01.2938475Z self, 2025-05-07T20:32:01.2938563Z T: int, 2025-05-07T20:32:01.2938641Z D: int, 2025-05-07T20:32:01.2938741Z scale_ub: Optional[float], 2025-05-07T20:32:01.2938837Z contiguous: bool, 2025-05-07T20:32:01.2938924Z compiled: bool, 2025-05-07T20:32:01.2939010Z ) -> None: 2025-05-07T20:32:01.2939107Z torch.manual_seed(2025) 2025-05-07T20:32:01.2939182Z 2025-05-07T20:32:01.2939355Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2939435Z 2025-05-07T20:32:01.2939527Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2939651Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2939746Z x = x_sign * x_clamp 2025-05-07T20:32:01.2939904Z x0 = x[:, :D] 2025-05-07T20:32:01.2939988Z x1 = x[:, D:] 2025-05-07T20:32:01.2940068Z 2025-05-07T20:32:01.2940153Z if contiguous: 2025-05-07T20:32:01.2940246Z x0 = x0.contiguous() 2025-05-07T20:32:01.2940348Z x1 = x1.contiguous() 2025-05-07T20:32:01.2940422Z 2025-05-07T20:32:01.2940513Z if scale_ub is not None: 2025-05-07T20:32:01.2940630Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2940766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2940843Z ) 2025-05-07T20:32:01.2940927Z else: 2025-05-07T20:32:01.2941022Z scale_ub_tensor = None 2025-05-07T20:32:01.2941095Z 2025-05-07T20:32:01.2941232Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2941323Z op = silu_mul_quant 2025-05-07T20:32:01.2941416Z if compiled: 2025-05-07T20:32:01.2941517Z op = torch.compile(op) 2025-05-07T20:32:01.2941631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2941712Z 2025-05-07T20:32:01.2941805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2941809Z 2025-05-07T20:32:01.2941911Z moe/activation_test.py:117: 2025-05-07T20:32:01.2942052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2942154Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2942253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2942749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2942848Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2943208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2943428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2943768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2943870Z kernel = self.compile( 2025-05-07T20:32:01.2944251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2944434Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2944562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2944566Z 2025-05-07T20:32:01.2944767Z self = 2025-05-07T20:32:01.2945538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2946039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c39ee0>} 2025-05-07T20:32:01.2946778Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2947053Z context = 2025-05-07T20:32:01.2947057Z 2025-05-07T20:32:01.2947222Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2947486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2947596Z module_map=module_map) 2025-05-07T20:32:01.2947763Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2947864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2947946Z E ^ 2025-05-07T20:32:01.2948376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2948381Z 2025-05-07T20:32:01.2948789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2948799Z 2025-05-07T20:32:01.2948909Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2949130Z self=, 2025-05-07T20:32:01.2949209Z T=1, 2025-05-07T20:32:01.2949293Z D=7168, 2025-05-07T20:32:01.2949379Z scale_ub=1200.0, 2025-05-07T20:32:01.2949469Z contiguous=True, 2025-05-07T20:32:01.2949560Z compiled=True, 2025-05-07T20:32:01.2949636Z ) 2025-05-07T20:32:01.2949854Z self = 2025-05-07T20:32:01.2950025Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.2950029Z 2025-05-07T20:32:01.2950108Z @given( 2025-05-07T20:32:01.2950245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2950346Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2950463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2950589Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2950709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2950786Z ) 2025-05-07T20:32:01.2951034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2951128Z def test_silu_mul_quant( 2025-05-07T20:32:01.2951203Z self, 2025-05-07T20:32:01.2951287Z T: int, 2025-05-07T20:32:01.2951364Z D: int, 2025-05-07T20:32:01.2951463Z scale_ub: Optional[float], 2025-05-07T20:32:01.2951561Z contiguous: bool, 2025-05-07T20:32:01.2951648Z compiled: bool, 2025-05-07T20:32:01.2951737Z ) -> None: 2025-05-07T20:32:01.2951833Z torch.manual_seed(2025) 2025-05-07T20:32:01.2951906Z 2025-05-07T20:32:01.2952082Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2952156Z 2025-05-07T20:32:01.2952249Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2952379Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2952472Z x = x_sign * x_clamp 2025-05-07T20:32:01.2952554Z x0 = x[:, :D] 2025-05-07T20:32:01.2952643Z x1 = x[:, D:] 2025-05-07T20:32:01.2952717Z 2025-05-07T20:32:01.2952802Z if contiguous: 2025-05-07T20:32:01.2952900Z x0 = x0.contiguous() 2025-05-07T20:32:01.2952991Z x1 = x1.contiguous() 2025-05-07T20:32:01.2953070Z 2025-05-07T20:32:01.2953162Z if scale_ub is not None: 2025-05-07T20:32:01.2953268Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2953409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2953486Z ) 2025-05-07T20:32:01.2953563Z else: 2025-05-07T20:32:01.2953669Z scale_ub_tensor = None 2025-05-07T20:32:01.2953746Z 2025-05-07T20:32:01.2953876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2953975Z op = silu_mul_quant 2025-05-07T20:32:01.2954061Z if compiled: 2025-05-07T20:32:01.2954247Z op = torch.compile(op) 2025-05-07T20:32:01.2954364Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2954439Z 2025-05-07T20:32:01.2954541Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2954545Z 2025-05-07T20:32:01.2954644Z moe/activation_test.py:117: 2025-05-07T20:32:01.2954773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2954879Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2954980Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2955344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2955449Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2956038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2956145Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2956508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2956728Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2957069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2957163Z kernel = self.compile( 2025-05-07T20:32:01.2957543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2957719Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2957847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2957857Z 2025-05-07T20:32:01.2958065Z self = 2025-05-07T20:32:01.2958831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2959332Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c3a660>} 2025-05-07T20:32:01.2960073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2960264Z context = 2025-05-07T20:32:01.2960269Z 2025-05-07T20:32:01.2960450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2960710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2960825Z module_map=module_map) 2025-05-07T20:32:01.2960992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2961093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2961179Z E ^ 2025-05-07T20:32:01.2961527Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2961531Z 2025-05-07T20:32:01.2961936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2961940Z 2025-05-07T20:32:01.2962050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2962271Z self=, 2025-05-07T20:32:01.2962354Z T=1, 2025-05-07T20:32:01.2962438Z D=7168, 2025-05-07T20:32:01.2962524Z scale_ub=1200.0, 2025-05-07T20:32:01.2962620Z contiguous=False, 2025-05-07T20:32:01.2962705Z compiled=True, 2025-05-07T20:32:01.2962781Z ) 2025-05-07T20:32:01.2963160Z self = 2025-05-07T20:32:01.2963324Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.2963329Z 2025-05-07T20:32:01.2963408Z @given( 2025-05-07T20:32:01.2963533Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2963634Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2963755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2963873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2963989Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2964071Z ) 2025-05-07T20:32:01.2964388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2964484Z def test_silu_mul_quant( 2025-05-07T20:32:01.2964568Z self, 2025-05-07T20:32:01.2964648Z T: int, 2025-05-07T20:32:01.2964727Z D: int, 2025-05-07T20:32:01.2964839Z scale_ub: Optional[float], 2025-05-07T20:32:01.2964932Z contiguous: bool, 2025-05-07T20:32:01.2965019Z compiled: bool, 2025-05-07T20:32:01.2965109Z ) -> None: 2025-05-07T20:32:01.2965208Z torch.manual_seed(2025) 2025-05-07T20:32:01.2965289Z 2025-05-07T20:32:01.2965457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2965532Z 2025-05-07T20:32:01.2965630Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2965755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2965847Z x = x_sign * x_clamp 2025-05-07T20:32:01.2965936Z x0 = x[:, :D] 2025-05-07T20:32:01.2966017Z x1 = x[:, D:] 2025-05-07T20:32:01.2966090Z 2025-05-07T20:32:01.2966186Z if contiguous: 2025-05-07T20:32:01.2966277Z x0 = x0.contiguous() 2025-05-07T20:32:01.2966367Z x1 = x1.contiguous() 2025-05-07T20:32:01.2966448Z 2025-05-07T20:32:01.2966540Z if scale_ub is not None: 2025-05-07T20:32:01.2966651Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2966796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2966871Z ) 2025-05-07T20:32:01.2966954Z else: 2025-05-07T20:32:01.2967048Z scale_ub_tensor = None 2025-05-07T20:32:01.2967122Z 2025-05-07T20:32:01.2967258Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2967348Z op = silu_mul_quant 2025-05-07T20:32:01.2967432Z if compiled: 2025-05-07T20:32:01.2967538Z op = torch.compile(op) 2025-05-07T20:32:01.2967644Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2967717Z 2025-05-07T20:32:01.2967814Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2967823Z 2025-05-07T20:32:01.2967922Z moe/activation_test.py:117: 2025-05-07T20:32:01.2968058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2968158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2968262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2968632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2968726Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2969214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2969317Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2969672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2969901Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2970242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2970336Z kernel = self.compile( 2025-05-07T20:32:01.2970721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2970992Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2971121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2971132Z 2025-05-07T20:32:01.2971334Z self = 2025-05-07T20:32:01.2972103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2972685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c393a0>} 2025-05-07T20:32:01.2973421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2973716Z context = 2025-05-07T20:32:01.2973725Z 2025-05-07T20:32:01.2973890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2974148Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2974265Z module_map=module_map) 2025-05-07T20:32:01.2974429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2974533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.2974612Z E ^ 2025-05-07T20:32:01.2974967Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2974971Z 2025-05-07T20:32:01.2975384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2975394Z 2025-05-07T20:32:01.2975499Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2975720Z self=, 2025-05-07T20:32:01.2975807Z T=1, 2025-05-07T20:32:01.2975885Z D=7168, 2025-05-07T20:32:01.2975976Z scale_ub=None, 2025-05-07T20:32:01.2976064Z contiguous=False, 2025-05-07T20:32:01.2976149Z compiled=True, 2025-05-07T20:32:01.2976228Z ) 2025-05-07T20:32:01.2976445Z self = 2025-05-07T20:32:01.2976605Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.2976609Z 2025-05-07T20:32:01.2976697Z @given( 2025-05-07T20:32:01.2976817Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2976918Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2977039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2977162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2977282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2977359Z ) 2025-05-07T20:32:01.2977602Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2977701Z def test_silu_mul_quant( 2025-05-07T20:32:01.2977777Z self, 2025-05-07T20:32:01.2977855Z T: int, 2025-05-07T20:32:01.2977939Z D: int, 2025-05-07T20:32:01.2978037Z scale_ub: Optional[float], 2025-05-07T20:32:01.2978126Z contiguous: bool, 2025-05-07T20:32:01.2978218Z compiled: bool, 2025-05-07T20:32:01.2978297Z ) -> None: 2025-05-07T20:32:01.2978392Z torch.manual_seed(2025) 2025-05-07T20:32:01.2978482Z 2025-05-07T20:32:01.2978650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2978730Z 2025-05-07T20:32:01.2978823Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2979089Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2979182Z x = x_sign * x_clamp 2025-05-07T20:32:01.2979264Z x0 = x[:, :D] 2025-05-07T20:32:01.2979346Z x1 = x[:, D:] 2025-05-07T20:32:01.2979427Z 2025-05-07T20:32:01.2979513Z if contiguous: 2025-05-07T20:32:01.2979604Z x0 = x0.contiguous() 2025-05-07T20:32:01.2979703Z x1 = x1.contiguous() 2025-05-07T20:32:01.2979776Z 2025-05-07T20:32:01.2979866Z if scale_ub is not None: 2025-05-07T20:32:01.2979978Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2980118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2980200Z ) 2025-05-07T20:32:01.2980355Z else: 2025-05-07T20:32:01.2980452Z scale_ub_tensor = None 2025-05-07T20:32:01.2980530Z 2025-05-07T20:32:01.2980658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2980749Z op = silu_mul_quant 2025-05-07T20:32:01.2980846Z if compiled: 2025-05-07T20:32:01.2980945Z op = torch.compile(op) 2025-05-07T20:32:01.2981051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2981130Z 2025-05-07T20:32:01.2981221Z y_fp8, y_scale = fn() 2025-05-07T20:32:01.2981344Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:01.2981425Z 2025-05-07T20:32:01.2981565Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2981674Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:01.2981778Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:01.2981901Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:01.2982054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2982129Z 2025-05-07T20:32:01.2982229Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:01.2982233Z 2025-05-07T20:32:01.2982336Z moe/activation_test.py:126: 2025-05-07T20:32:01.2982473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2982580Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:01.2982719Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:01.2983266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:01.2983375Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:01.2983735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2983959Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2984333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:01.2984594Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:01.2984969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:01.2985134Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:01.2985478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:01.2985555Z fn() 2025-05-07T20:32:01.2985957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:01.2986042Z self.fn.run( 2025-05-07T20:32:01.2986378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.2986482Z kernel = self.compile( 2025-05-07T20:32:01.2986863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.2987035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.2987256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2987261Z 2025-05-07T20:32:01.2987465Z self = 2025-05-07T20:32:01.2988232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.2988728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4c3bce0>} 2025-05-07T20:32:01.2989569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.2989766Z context = 2025-05-07T20:32:01.2989771Z 2025-05-07T20:32:01.2989937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.2990199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.2990308Z module_map=module_map) 2025-05-07T20:32:01.2990474Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.2990577Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:01.2990654Z E ^ 2025-05-07T20:32:01.2991011Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.2991015Z 2025-05-07T20:32:01.2991429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.2991433Z 2025-05-07T20:32:01.2991546Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.2991774Z self=, 2025-05-07T20:32:01.2991854Z T=1, 2025-05-07T20:32:01.2991941Z D=5120, 2025-05-07T20:32:01.2992026Z scale_ub=1200.0, 2025-05-07T20:32:01.2992113Z contiguous=False, 2025-05-07T20:32:01.2992204Z compiled=True, 2025-05-07T20:32:01.2992278Z ) 2025-05-07T20:32:01.2992492Z self = 2025-05-07T20:32:01.2992662Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.2992666Z 2025-05-07T20:32:01.2992745Z @given( 2025-05-07T20:32:01.2992865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.2992976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.2993092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.2993214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.2993328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.2993414Z ) 2025-05-07T20:32:01.2993662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.2993758Z def test_silu_mul_quant( 2025-05-07T20:32:01.2993836Z self, 2025-05-07T20:32:01.2993921Z T: int, 2025-05-07T20:32:01.2993999Z D: int, 2025-05-07T20:32:01.2994099Z scale_ub: Optional[float], 2025-05-07T20:32:01.2994195Z contiguous: bool, 2025-05-07T20:32:01.2994282Z compiled: bool, 2025-05-07T20:32:01.2994370Z ) -> None: 2025-05-07T20:32:01.2994466Z torch.manual_seed(2025) 2025-05-07T20:32:01.2994542Z 2025-05-07T20:32:01.2994714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.2994793Z 2025-05-07T20:32:01.2994888Z x_sign = torch.sign(x) 2025-05-07T20:32:01.2995018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.2995106Z x = x_sign * x_clamp 2025-05-07T20:32:01.2995271Z x0 = x[:, :D] 2025-05-07T20:32:01.2995357Z x1 = x[:, D:] 2025-05-07T20:32:01.2995432Z 2025-05-07T20:32:01.2995516Z if contiguous: 2025-05-07T20:32:01.2995616Z x0 = x0.contiguous() 2025-05-07T20:32:01.2995709Z x1 = x1.contiguous() 2025-05-07T20:32:01.2995781Z 2025-05-07T20:32:01.2995879Z if scale_ub is not None: 2025-05-07T20:32:01.2995986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.2996125Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.2996202Z ) 2025-05-07T20:32:01.2996280Z else: 2025-05-07T20:32:01.2996379Z scale_ub_tensor = None 2025-05-07T20:32:01.2996451Z 2025-05-07T20:32:01.2996665Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.2996761Z op = silu_mul_quant 2025-05-07T20:32:01.2996845Z if compiled: 2025-05-07T20:32:01.2996946Z op = torch.compile(op) 2025-05-07T20:32:01.2997063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2997135Z 2025-05-07T20:32:01.2997226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.2997235Z 2025-05-07T20:32:01.2997330Z moe/activation_test.py:117: 2025-05-07T20:32:01.2997459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.2997565Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.2997663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.2998024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.2998124Z return fn(*args, **kwargs) 2025-05-07T20:32:01.2998953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.2999058Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.2999421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.2999646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.2999990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3000083Z kernel = self.compile( 2025-05-07T20:32:01.3000462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3000640Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3000770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3000775Z 2025-05-07T20:32:01.3000983Z self = 2025-05-07T20:32:01.3001744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3002243Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4867920>} 2025-05-07T20:32:01.3002979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3003168Z context = 2025-05-07T20:32:01.3003173Z 2025-05-07T20:32:01.3003342Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3003603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3003712Z module_map=module_map) 2025-05-07T20:32:01.3003881Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3004139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3004220Z E ^ 2025-05-07T20:32:01.3004567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3004571Z 2025-05-07T20:32:01.3004977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3004982Z 2025-05-07T20:32:01.3005091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3005333Z self=, 2025-05-07T20:32:01.3005418Z T=1, 2025-05-07T20:32:01.3005510Z D=5120, 2025-05-07T20:32:01.3005597Z scale_ub=1200.0, 2025-05-07T20:32:01.3005795Z contiguous=False, 2025-05-07T20:32:01.3005881Z compiled=False, 2025-05-07T20:32:01.3005953Z ) 2025-05-07T20:32:01.3006173Z self = 2025-05-07T20:32:01.3006342Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3006347Z 2025-05-07T20:32:01.3006422Z @given( 2025-05-07T20:32:01.3006549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3006645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3006756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3006876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3006988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3007070Z ) 2025-05-07T20:32:01.3007310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3007403Z def test_silu_mul_quant( 2025-05-07T20:32:01.3007486Z self, 2025-05-07T20:32:01.3007563Z T: int, 2025-05-07T20:32:01.3007639Z D: int, 2025-05-07T20:32:01.3007741Z scale_ub: Optional[float], 2025-05-07T20:32:01.3007830Z contiguous: bool, 2025-05-07T20:32:01.3007918Z compiled: bool, 2025-05-07T20:32:01.3007999Z ) -> None: 2025-05-07T20:32:01.3008093Z torch.manual_seed(2025) 2025-05-07T20:32:01.3008162Z 2025-05-07T20:32:01.3008334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3008406Z 2025-05-07T20:32:01.3008502Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3008625Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3008713Z x = x_sign * x_clamp 2025-05-07T20:32:01.3008798Z x0 = x[:, :D] 2025-05-07T20:32:01.3008876Z x1 = x[:, D:] 2025-05-07T20:32:01.3008949Z 2025-05-07T20:32:01.3009037Z if contiguous: 2025-05-07T20:32:01.3009127Z x0 = x0.contiguous() 2025-05-07T20:32:01.3009221Z x1 = x1.contiguous() 2025-05-07T20:32:01.3009298Z 2025-05-07T20:32:01.3009387Z if scale_ub is not None: 2025-05-07T20:32:01.3009493Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3009635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3009711Z ) 2025-05-07T20:32:01.3009791Z else: 2025-05-07T20:32:01.3009886Z scale_ub_tensor = None 2025-05-07T20:32:01.3009958Z 2025-05-07T20:32:01.3010090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3010179Z op = silu_mul_quant 2025-05-07T20:32:01.3010261Z if compiled: 2025-05-07T20:32:01.3010364Z op = torch.compile(op) 2025-05-07T20:32:01.3010471Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3010543Z 2025-05-07T20:32:01.3010640Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3010644Z 2025-05-07T20:32:01.3010739Z moe/activation_test.py:117: 2025-05-07T20:32:01.3010875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3010973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3011070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3011652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3011748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3012101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3012324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3012658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3012752Z kernel = self.compile( 2025-05-07T20:32:01.3013204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3013378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3013508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3013518Z 2025-05-07T20:32:01.3013788Z self = 2025-05-07T20:32:01.3014552Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3015048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4d72b60>} 2025-05-07T20:32:01.3015832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3016024Z context = 2025-05-07T20:32:01.3016029Z 2025-05-07T20:32:01.3016193Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3016461Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3016569Z module_map=module_map) 2025-05-07T20:32:01.3016731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3016832Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3016906Z E ^ 2025-05-07T20:32:01.3017254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3017263Z 2025-05-07T20:32:01.3017674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3017679Z 2025-05-07T20:32:01.3017781Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3018005Z self=, 2025-05-07T20:32:01.3018089Z T=16384, 2025-05-07T20:32:01.3018167Z D=5120, 2025-05-07T20:32:01.3018254Z scale_ub=1200.0, 2025-05-07T20:32:01.3018339Z contiguous=False, 2025-05-07T20:32:01.3018420Z compiled=True, 2025-05-07T20:32:01.3018498Z ) 2025-05-07T20:32:01.3018714Z self = 2025-05-07T20:32:01.3018890Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3018895Z 2025-05-07T20:32:01.3018972Z @given( 2025-05-07T20:32:01.3019091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3019193Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3019305Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3019426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3019540Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3019614Z ) 2025-05-07T20:32:01.3019855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3020067Z def test_silu_mul_quant( 2025-05-07T20:32:01.3020143Z self, 2025-05-07T20:32:01.3020224Z T: int, 2025-05-07T20:32:01.3020302Z D: int, 2025-05-07T20:32:01.3020400Z scale_ub: Optional[float], 2025-05-07T20:32:01.3020491Z contiguous: bool, 2025-05-07T20:32:01.3020576Z compiled: bool, 2025-05-07T20:32:01.3020654Z ) -> None: 2025-05-07T20:32:01.3020752Z torch.manual_seed(2025) 2025-05-07T20:32:01.3020824Z 2025-05-07T20:32:01.3020989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3021065Z 2025-05-07T20:32:01.3021155Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3021353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3021447Z x = x_sign * x_clamp 2025-05-07T20:32:01.3021525Z x0 = x[:, :D] 2025-05-07T20:32:01.3021605Z x1 = x[:, D:] 2025-05-07T20:32:01.3021678Z 2025-05-07T20:32:01.3021766Z if contiguous: 2025-05-07T20:32:01.3021863Z x0 = x0.contiguous() 2025-05-07T20:32:01.3021952Z x1 = x1.contiguous() 2025-05-07T20:32:01.3022024Z 2025-05-07T20:32:01.3022121Z if scale_ub is not None: 2025-05-07T20:32:01.3022225Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3022358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3022438Z ) 2025-05-07T20:32:01.3022514Z else: 2025-05-07T20:32:01.3022606Z scale_ub_tensor = None 2025-05-07T20:32:01.3022685Z 2025-05-07T20:32:01.3022812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3022905Z op = silu_mul_quant 2025-05-07T20:32:01.3022996Z if compiled: 2025-05-07T20:32:01.3023094Z op = torch.compile(op) 2025-05-07T20:32:01.3023204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3023276Z 2025-05-07T20:32:01.3023366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3023376Z 2025-05-07T20:32:01.3023473Z moe/activation_test.py:117: 2025-05-07T20:32:01.3023601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3023698Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3023802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3024162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3024258Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3024747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3024842Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3025209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3025427Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3025766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3025862Z kernel = self.compile( 2025-05-07T20:32:01.3026241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3026416Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3026544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3026549Z 2025-05-07T20:32:01.3026750Z self = 2025-05-07T20:32:01.3027521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3028017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a4d72d40>} 2025-05-07T20:32:01.3028833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3029019Z context = 2025-05-07T20:32:01.3029023Z 2025-05-07T20:32:01.3029188Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3029448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3029627Z module_map=module_map) 2025-05-07T20:32:01.3029793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3029893Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3029969Z E ^ 2025-05-07T20:32:01.3030326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3030330Z 2025-05-07T20:32:01.3030736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3030740Z 2025-05-07T20:32:01.3030845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3031063Z self=, 2025-05-07T20:32:01.3031140Z T=2048, 2025-05-07T20:32:01.3031222Z D=7168, 2025-05-07T20:32:01.3031306Z scale_ub=1200.0, 2025-05-07T20:32:01.3031392Z contiguous=False, 2025-05-07T20:32:01.3031479Z compiled=True, 2025-05-07T20:32:01.3031554Z ) 2025-05-07T20:32:01.3031773Z self = 2025-05-07T20:32:01.3031949Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3031957Z 2025-05-07T20:32:01.3032034Z @given( 2025-05-07T20:32:01.3032158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3032258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3032375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3032496Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3032609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3032686Z ) 2025-05-07T20:32:01.3032933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3033026Z def test_silu_mul_quant( 2025-05-07T20:32:01.3033103Z self, 2025-05-07T20:32:01.3033185Z T: int, 2025-05-07T20:32:01.3033264Z D: int, 2025-05-07T20:32:01.3033371Z scale_ub: Optional[float], 2025-05-07T20:32:01.3033461Z contiguous: bool, 2025-05-07T20:32:01.3033546Z compiled: bool, 2025-05-07T20:32:01.3033627Z ) -> None: 2025-05-07T20:32:01.3033720Z torch.manual_seed(2025) 2025-05-07T20:32:01.3033797Z 2025-05-07T20:32:01.3033970Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3034043Z 2025-05-07T20:32:01.3034133Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3034260Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3034347Z x = x_sign * x_clamp 2025-05-07T20:32:01.3034427Z x0 = x[:, :D] 2025-05-07T20:32:01.3034510Z x1 = x[:, D:] 2025-05-07T20:32:01.3034580Z 2025-05-07T20:32:01.3034666Z if contiguous: 2025-05-07T20:32:01.3034755Z x0 = x0.contiguous() 2025-05-07T20:32:01.3034842Z x1 = x1.contiguous() 2025-05-07T20:32:01.3034919Z 2025-05-07T20:32:01.3035014Z if scale_ub is not None: 2025-05-07T20:32:01.3035118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3035258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3035332Z ) 2025-05-07T20:32:01.3035494Z else: 2025-05-07T20:32:01.3035592Z scale_ub_tensor = None 2025-05-07T20:32:01.3035664Z 2025-05-07T20:32:01.3035792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3035881Z op = silu_mul_quant 2025-05-07T20:32:01.3035965Z if compiled: 2025-05-07T20:32:01.3036063Z op = torch.compile(op) 2025-05-07T20:32:01.3036170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3036240Z 2025-05-07T20:32:01.3036335Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3036339Z 2025-05-07T20:32:01.3036435Z moe/activation_test.py:117: 2025-05-07T20:32:01.3036565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3036749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3036849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3037212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3037315Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3037801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3037901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3038255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3038474Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3038814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3038905Z kernel = self.compile( 2025-05-07T20:32:01.3039286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3039462Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3039595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3039599Z 2025-05-07T20:32:01.3039804Z self = 2025-05-07T20:32:01.3040562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3041058Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93cd947e20>} 2025-05-07T20:32:01.3041794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3041990Z context = 2025-05-07T20:32:01.3041998Z 2025-05-07T20:32:01.3042168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3042426Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3042537Z module_map=module_map) 2025-05-07T20:32:01.3042697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3042795Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3042878Z E ^ 2025-05-07T20:32:01.3043222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3043227Z 2025-05-07T20:32:01.3043635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3043645Z 2025-05-07T20:32:01.3043747Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3043967Z self=, 2025-05-07T20:32:01.3044131Z T=1, 2025-05-07T20:32:01.3044207Z D=5120, 2025-05-07T20:32:01.3044290Z scale_ub=None, 2025-05-07T20:32:01.3044377Z contiguous=False, 2025-05-07T20:32:01.3044462Z compiled=False, 2025-05-07T20:32:01.3044538Z ) 2025-05-07T20:32:01.3044757Z self = 2025-05-07T20:32:01.3044920Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3044924Z 2025-05-07T20:32:01.3045004Z @given( 2025-05-07T20:32:01.3045122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3045221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3045436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3045573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3045690Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3045767Z ) 2025-05-07T20:32:01.3046013Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3046103Z def test_silu_mul_quant( 2025-05-07T20:32:01.3046182Z self, 2025-05-07T20:32:01.3046258Z T: int, 2025-05-07T20:32:01.3050345Z D: int, 2025-05-07T20:32:01.3050448Z scale_ub: Optional[float], 2025-05-07T20:32:01.3050546Z contiguous: bool, 2025-05-07T20:32:01.3050631Z compiled: bool, 2025-05-07T20:32:01.3050711Z ) -> None: 2025-05-07T20:32:01.3050810Z torch.manual_seed(2025) 2025-05-07T20:32:01.3050881Z 2025-05-07T20:32:01.3051050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3051124Z 2025-05-07T20:32:01.3051221Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3051344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3051435Z x = x_sign * x_clamp 2025-05-07T20:32:01.3051514Z x0 = x[:, :D] 2025-05-07T20:32:01.3051594Z x1 = x[:, D:] 2025-05-07T20:32:01.3051677Z 2025-05-07T20:32:01.3051760Z if contiguous: 2025-05-07T20:32:01.3051855Z x0 = x0.contiguous() 2025-05-07T20:32:01.3051942Z x1 = x1.contiguous() 2025-05-07T20:32:01.3052013Z 2025-05-07T20:32:01.3052109Z if scale_ub is not None: 2025-05-07T20:32:01.3052215Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3052347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3052425Z ) 2025-05-07T20:32:01.3052504Z else: 2025-05-07T20:32:01.3052597Z scale_ub_tensor = None 2025-05-07T20:32:01.3052671Z 2025-05-07T20:32:01.3052800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3052897Z op = silu_mul_quant 2025-05-07T20:32:01.3052983Z if compiled: 2025-05-07T20:32:01.3053083Z op = torch.compile(op) 2025-05-07T20:32:01.3053189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3053265Z 2025-05-07T20:32:01.3053353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3053358Z 2025-05-07T20:32:01.3053457Z moe/activation_test.py:117: 2025-05-07T20:32:01.3053586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3053751Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3053857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3054352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3054447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3054804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3055027Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3055368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3055573Z kernel = self.compile( 2025-05-07T20:32:01.3055951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3056126Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3056253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3056258Z 2025-05-07T20:32:01.3056463Z self = 2025-05-07T20:32:01.3057300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3057797Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a55eb100>} 2025-05-07T20:32:01.3058539Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3058725Z context = 2025-05-07T20:32:01.3058730Z 2025-05-07T20:32:01.3058898Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3059156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3059265Z module_map=module_map) 2025-05-07T20:32:01.3059429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3059532Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3059613Z E ^ 2025-05-07T20:32:01.3059963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3059973Z 2025-05-07T20:32:01.3060379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3060384Z 2025-05-07T20:32:01.3060490Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3060708Z self=, 2025-05-07T20:32:01.3060793Z T=4096, 2025-05-07T20:32:01.3060870Z D=7168, 2025-05-07T20:32:01.3060954Z scale_ub=1200.0, 2025-05-07T20:32:01.3061046Z contiguous=False, 2025-05-07T20:32:01.3061129Z compiled=False, 2025-05-07T20:32:01.3061203Z ) 2025-05-07T20:32:01.3061425Z self = 2025-05-07T20:32:01.3061603Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3061607Z 2025-05-07T20:32:01.3061683Z @given( 2025-05-07T20:32:01.3061804Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3061909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3062023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3062144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3062257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3062337Z ) 2025-05-07T20:32:01.3062581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3062676Z def test_silu_mul_quant( 2025-05-07T20:32:01.3062759Z self, 2025-05-07T20:32:01.3062835Z T: int, 2025-05-07T20:32:01.3062913Z D: int, 2025-05-07T20:32:01.3063020Z scale_ub: Optional[float], 2025-05-07T20:32:01.3063109Z contiguous: bool, 2025-05-07T20:32:01.3063202Z compiled: bool, 2025-05-07T20:32:01.3063284Z ) -> None: 2025-05-07T20:32:01.3063383Z torch.manual_seed(2025) 2025-05-07T20:32:01.3063457Z 2025-05-07T20:32:01.3063624Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3063787Z 2025-05-07T20:32:01.3063879Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3064000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3064092Z x = x_sign * x_clamp 2025-05-07T20:32:01.3064173Z x0 = x[:, :D] 2025-05-07T20:32:01.3064252Z x1 = x[:, D:] 2025-05-07T20:32:01.3064326Z 2025-05-07T20:32:01.3064408Z if contiguous: 2025-05-07T20:32:01.3064506Z x0 = x0.contiguous() 2025-05-07T20:32:01.3064594Z x1 = x1.contiguous() 2025-05-07T20:32:01.3064668Z 2025-05-07T20:32:01.3064758Z if scale_ub is not None: 2025-05-07T20:32:01.3064862Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3065070Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3065152Z ) 2025-05-07T20:32:01.3065231Z else: 2025-05-07T20:32:01.3065326Z scale_ub_tensor = None 2025-05-07T20:32:01.3065401Z 2025-05-07T20:32:01.3065534Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3065621Z op = silu_mul_quant 2025-05-07T20:32:01.3065709Z if compiled: 2025-05-07T20:32:01.3065806Z op = torch.compile(op) 2025-05-07T20:32:01.3065913Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3065983Z 2025-05-07T20:32:01.3066073Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3066077Z 2025-05-07T20:32:01.3066178Z moe/activation_test.py:117: 2025-05-07T20:32:01.3066305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3066403Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3066503Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3066995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3067094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3067457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3067675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3068012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3068105Z kernel = self.compile( 2025-05-07T20:32:01.3068481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3068656Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3068779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3068791Z 2025-05-07T20:32:01.3068995Z self = 2025-05-07T20:32:01.3069755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3070254Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cdc2c0>} 2025-05-07T20:32:01.3070991Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3071177Z context = 2025-05-07T20:32:01.3071182Z 2025-05-07T20:32:01.3071355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3071610Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3071716Z module_map=module_map) 2025-05-07T20:32:01.3071963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3072063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3072144Z E ^ 2025-05-07T20:32:01.3072492Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3072497Z 2025-05-07T20:32:01.3072902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3072906Z 2025-05-07T20:32:01.3073012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3073231Z self=, 2025-05-07T20:32:01.3073311Z T=16384, 2025-05-07T20:32:01.3073465Z D=7168, 2025-05-07T20:32:01.3073551Z scale_ub=None, 2025-05-07T20:32:01.3073639Z contiguous=True, 2025-05-07T20:32:01.3073723Z compiled=True, 2025-05-07T20:32:01.3073797Z ) 2025-05-07T20:32:01.3074021Z self = 2025-05-07T20:32:01.3074192Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.3074196Z 2025-05-07T20:32:01.3074272Z @given( 2025-05-07T20:32:01.3074393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3074493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3074611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3074727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3074841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3074919Z ) 2025-05-07T20:32:01.3075164Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3075258Z def test_silu_mul_quant( 2025-05-07T20:32:01.3075338Z self, 2025-05-07T20:32:01.3075417Z T: int, 2025-05-07T20:32:01.3075495Z D: int, 2025-05-07T20:32:01.3075595Z scale_ub: Optional[float], 2025-05-07T20:32:01.3075690Z contiguous: bool, 2025-05-07T20:32:01.3075774Z compiled: bool, 2025-05-07T20:32:01.3075855Z ) -> None: 2025-05-07T20:32:01.3075949Z torch.manual_seed(2025) 2025-05-07T20:32:01.3076024Z 2025-05-07T20:32:01.3076188Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3076262Z 2025-05-07T20:32:01.3076356Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3076483Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3076571Z x = x_sign * x_clamp 2025-05-07T20:32:01.3076658Z x0 = x[:, :D] 2025-05-07T20:32:01.3076739Z x1 = x[:, D:] 2025-05-07T20:32:01.3076811Z 2025-05-07T20:32:01.3076903Z if contiguous: 2025-05-07T20:32:01.3076994Z x0 = x0.contiguous() 2025-05-07T20:32:01.3077084Z x1 = x1.contiguous() 2025-05-07T20:32:01.3077160Z 2025-05-07T20:32:01.3077251Z if scale_ub is not None: 2025-05-07T20:32:01.3077360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3077497Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3077572Z ) 2025-05-07T20:32:01.3077651Z else: 2025-05-07T20:32:01.3077743Z scale_ub_tensor = None 2025-05-07T20:32:01.3077814Z 2025-05-07T20:32:01.3077946Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3078035Z op = silu_mul_quant 2025-05-07T20:32:01.3078119Z if compiled: 2025-05-07T20:32:01.3078220Z op = torch.compile(op) 2025-05-07T20:32:01.3078323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3078394Z 2025-05-07T20:32:01.3078486Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3078495Z 2025-05-07T20:32:01.3078590Z moe/activation_test.py:117: 2025-05-07T20:32:01.3078720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3078819Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3079029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3079397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3079488Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3079970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3080069Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3080423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3080644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3081051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3081145Z kernel = self.compile( 2025-05-07T20:32:01.3081526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3081703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3081828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3081836Z 2025-05-07T20:32:01.3082037Z self = 2025-05-07T20:32:01.3082798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3083301Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cddc60>} 2025-05-07T20:32:01.3084033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3084228Z context = 2025-05-07T20:32:01.3084232Z 2025-05-07T20:32:01.3084393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3084647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3084755Z module_map=module_map) 2025-05-07T20:32:01.3084914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3085012Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3085095Z E ^ 2025-05-07T20:32:01.3085445Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3085450Z 2025-05-07T20:32:01.3085855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3085863Z 2025-05-07T20:32:01.3085964Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3086182Z self=, 2025-05-07T20:32:01.3086262Z T=4096, 2025-05-07T20:32:01.3086337Z D=5120, 2025-05-07T20:32:01.3086421Z scale_ub=None, 2025-05-07T20:32:01.3086507Z contiguous=False, 2025-05-07T20:32:01.3086589Z compiled=True, 2025-05-07T20:32:01.3086662Z ) 2025-05-07T20:32:01.3086875Z self = 2025-05-07T20:32:01.3087041Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3087045Z 2025-05-07T20:32:01.3087131Z @given( 2025-05-07T20:32:01.3087248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3087345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3087461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3087746Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3087865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3087939Z ) 2025-05-07T20:32:01.3088180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3088277Z def test_silu_mul_quant( 2025-05-07T20:32:01.3088354Z self, 2025-05-07T20:32:01.3088429Z T: int, 2025-05-07T20:32:01.3088510Z D: int, 2025-05-07T20:32:01.3088605Z scale_ub: Optional[float], 2025-05-07T20:32:01.3088693Z contiguous: bool, 2025-05-07T20:32:01.3088780Z compiled: bool, 2025-05-07T20:32:01.3088857Z ) -> None: 2025-05-07T20:32:01.3088949Z torch.manual_seed(2025) 2025-05-07T20:32:01.3089099Z 2025-05-07T20:32:01.3089265Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3089340Z 2025-05-07T20:32:01.3089432Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3089560Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3089650Z x = x_sign * x_clamp 2025-05-07T20:32:01.3089729Z x0 = x[:, :D] 2025-05-07T20:32:01.3089808Z x1 = x[:, D:] 2025-05-07T20:32:01.3089883Z 2025-05-07T20:32:01.3089965Z if contiguous: 2025-05-07T20:32:01.3090054Z x0 = x0.contiguous() 2025-05-07T20:32:01.3090145Z x1 = x1.contiguous() 2025-05-07T20:32:01.3090216Z 2025-05-07T20:32:01.3090305Z if scale_ub is not None: 2025-05-07T20:32:01.3090412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3090543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3090617Z ) 2025-05-07T20:32:01.3090695Z else: 2025-05-07T20:32:01.3090792Z scale_ub_tensor = None 2025-05-07T20:32:01.3090868Z 2025-05-07T20:32:01.3090994Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3091081Z op = silu_mul_quant 2025-05-07T20:32:01.3091173Z if compiled: 2025-05-07T20:32:01.3091273Z op = torch.compile(op) 2025-05-07T20:32:01.3091378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3091452Z 2025-05-07T20:32:01.3091540Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3091545Z 2025-05-07T20:32:01.3091641Z moe/activation_test.py:117: 2025-05-07T20:32:01.3091771Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3091870Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3091976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3092335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3092432Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3092919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3093018Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3093371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3093592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3093985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3094080Z kernel = self.compile( 2025-05-07T20:32:01.3094457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3094629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3094763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3094768Z 2025-05-07T20:32:01.3094968Z self = 2025-05-07T20:32:01.3095730Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3096309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cde980>} 2025-05-07T20:32:01.3097039Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3097228Z context = 2025-05-07T20:32:01.3097305Z 2025-05-07T20:32:01.3097469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3097727Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3097838Z module_map=module_map) 2025-05-07T20:32:01.3097996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3098096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3098388Z E ^ 2025-05-07T20:32:01.3098828Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3098834Z 2025-05-07T20:32:01.3099238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3099242Z 2025-05-07T20:32:01.3099344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3099569Z self=, 2025-05-07T20:32:01.3099643Z T=4096, 2025-05-07T20:32:01.3099716Z D=5120, 2025-05-07T20:32:01.3099800Z scale_ub=1200.0, 2025-05-07T20:32:01.3099884Z contiguous=False, 2025-05-07T20:32:01.3099966Z compiled=False, 2025-05-07T20:32:01.3100043Z ) 2025-05-07T20:32:01.3100253Z self = 2025-05-07T20:32:01.3100427Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3100431Z 2025-05-07T20:32:01.3100504Z @given( 2025-05-07T20:32:01.3100620Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3100719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3100829Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3100942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3101055Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3101126Z ) 2025-05-07T20:32:01.3101372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3101463Z def test_silu_mul_quant( 2025-05-07T20:32:01.3101537Z self, 2025-05-07T20:32:01.3101615Z T: int, 2025-05-07T20:32:01.3101693Z D: int, 2025-05-07T20:32:01.3101787Z scale_ub: Optional[float], 2025-05-07T20:32:01.3101878Z contiguous: bool, 2025-05-07T20:32:01.3101960Z compiled: bool, 2025-05-07T20:32:01.3102036Z ) -> None: 2025-05-07T20:32:01.3102133Z torch.manual_seed(2025) 2025-05-07T20:32:01.3102202Z 2025-05-07T20:32:01.3102363Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3102440Z 2025-05-07T20:32:01.3102528Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3102654Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3102739Z x = x_sign * x_clamp 2025-05-07T20:32:01.3102817Z x0 = x[:, :D] 2025-05-07T20:32:01.3102896Z x1 = x[:, D:] 2025-05-07T20:32:01.3102971Z 2025-05-07T20:32:01.3103052Z if contiguous: 2025-05-07T20:32:01.3103145Z x0 = x0.contiguous() 2025-05-07T20:32:01.3103231Z x1 = x1.contiguous() 2025-05-07T20:32:01.3103449Z 2025-05-07T20:32:01.3103540Z if scale_ub is not None: 2025-05-07T20:32:01.3103642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3103774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3103851Z ) 2025-05-07T20:32:01.3103925Z else: 2025-05-07T20:32:01.3104016Z scale_ub_tensor = None 2025-05-07T20:32:01.3104092Z 2025-05-07T20:32:01.3104215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3104307Z op = silu_mul_quant 2025-05-07T20:32:01.3104389Z if compiled: 2025-05-07T20:32:01.3104485Z op = torch.compile(op) 2025-05-07T20:32:01.3104589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3104768Z 2025-05-07T20:32:01.3104858Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3104862Z 2025-05-07T20:32:01.3104959Z moe/activation_test.py:117: 2025-05-07T20:32:01.3105087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3105191Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3105292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3105832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3105934Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3106288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3106509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3106848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3106941Z kernel = self.compile( 2025-05-07T20:32:01.3107322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3107490Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3107624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3107629Z 2025-05-07T20:32:01.3107827Z self = 2025-05-07T20:32:01.3108589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3109078Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397cdfba0>} 2025-05-07T20:32:01.3109817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3110006Z context = 2025-05-07T20:32:01.3110010Z 2025-05-07T20:32:01.3110174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3110435Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3110540Z module_map=module_map) 2025-05-07T20:32:01.3110696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3110797Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3110872Z E ^ 2025-05-07T20:32:01.3111223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3111232Z 2025-05-07T20:32:01.3111636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3111640Z 2025-05-07T20:32:01.3111740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3112075Z self=, 2025-05-07T20:32:01.3112151Z T=4096, 2025-05-07T20:32:01.3112235Z D=5120, 2025-05-07T20:32:01.3112317Z scale_ub=1200.0, 2025-05-07T20:32:01.3112403Z contiguous=False, 2025-05-07T20:32:01.3112490Z compiled=True, 2025-05-07T20:32:01.3112562Z ) 2025-05-07T20:32:01.3112773Z self = 2025-05-07T20:32:01.3112948Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3112952Z 2025-05-07T20:32:01.3113026Z @given( 2025-05-07T20:32:01.3113141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3113315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3113428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3113544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3113654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3113731Z ) 2025-05-07T20:32:01.3113973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3114063Z def test_silu_mul_quant( 2025-05-07T20:32:01.3114138Z self, 2025-05-07T20:32:01.3114217Z T: int, 2025-05-07T20:32:01.3114291Z D: int, 2025-05-07T20:32:01.3114386Z scale_ub: Optional[float], 2025-05-07T20:32:01.3114477Z contiguous: bool, 2025-05-07T20:32:01.3114560Z compiled: bool, 2025-05-07T20:32:01.3114636Z ) -> None: 2025-05-07T20:32:01.3114730Z torch.manual_seed(2025) 2025-05-07T20:32:01.3114800Z 2025-05-07T20:32:01.3114972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3115043Z 2025-05-07T20:32:01.3115134Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3115257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3115343Z x = x_sign * x_clamp 2025-05-07T20:32:01.3115425Z x0 = x[:, :D] 2025-05-07T20:32:01.3115508Z x1 = x[:, D:] 2025-05-07T20:32:01.3115576Z 2025-05-07T20:32:01.3115657Z if contiguous: 2025-05-07T20:32:01.3115751Z x0 = x0.contiguous() 2025-05-07T20:32:01.3115836Z x1 = x1.contiguous() 2025-05-07T20:32:01.3115906Z 2025-05-07T20:32:01.3115997Z if scale_ub is not None: 2025-05-07T20:32:01.3116099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3116229Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3116307Z ) 2025-05-07T20:32:01.3116380Z else: 2025-05-07T20:32:01.3116476Z scale_ub_tensor = None 2025-05-07T20:32:01.3116547Z 2025-05-07T20:32:01.3116678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3116771Z op = silu_mul_quant 2025-05-07T20:32:01.3116852Z if compiled: 2025-05-07T20:32:01.3116948Z op = torch.compile(op) 2025-05-07T20:32:01.3117056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3117129Z 2025-05-07T20:32:01.3117217Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3117221Z 2025-05-07T20:32:01.3117317Z moe/activation_test.py:117: 2025-05-07T20:32:01.3117443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3117543Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3117639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3117998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3118093Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3118580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3118673Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3119026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3119329Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3119665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3119755Z kernel = self.compile( 2025-05-07T20:32:01.3120132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3120305Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3120428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3120433Z 2025-05-07T20:32:01.3120706Z self = 2025-05-07T20:32:01.3121465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3121967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d94ea0>} 2025-05-07T20:32:01.3122698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3122881Z context = 2025-05-07T20:32:01.3122885Z 2025-05-07T20:32:01.3123048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3123307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3123411Z module_map=module_map) 2025-05-07T20:32:01.3123573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3123674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3123748Z E ^ 2025-05-07T20:32:01.3124095Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3124100Z 2025-05-07T20:32:01.3124506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3124510Z 2025-05-07T20:32:01.3124617Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3124835Z self=, 2025-05-07T20:32:01.3124907Z T=2048, 2025-05-07T20:32:01.3124984Z D=7168, 2025-05-07T20:32:01.3125068Z scale_ub=1200.0, 2025-05-07T20:32:01.3125154Z contiguous=False, 2025-05-07T20:32:01.3125236Z compiled=False, 2025-05-07T20:32:01.3125305Z ) 2025-05-07T20:32:01.3125555Z self = 2025-05-07T20:32:01.3125744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3125748Z 2025-05-07T20:32:01.3125824Z @given( 2025-05-07T20:32:01.3125942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3126039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3126147Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3126263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3126374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3126449Z ) 2025-05-07T20:32:01.3126689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3126783Z def test_silu_mul_quant( 2025-05-07T20:32:01.3126859Z self, 2025-05-07T20:32:01.3126932Z T: int, 2025-05-07T20:32:01.3127005Z D: int, 2025-05-07T20:32:01.3127101Z scale_ub: Optional[float], 2025-05-07T20:32:01.3127275Z contiguous: bool, 2025-05-07T20:32:01.3127359Z compiled: bool, 2025-05-07T20:32:01.3127438Z ) -> None: 2025-05-07T20:32:01.3127529Z torch.manual_seed(2025) 2025-05-07T20:32:01.3127598Z 2025-05-07T20:32:01.3127763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3127832Z 2025-05-07T20:32:01.3127923Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3128043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3128129Z x = x_sign * x_clamp 2025-05-07T20:32:01.3128209Z x0 = x[:, :D] 2025-05-07T20:32:01.3128285Z x1 = x[:, D:] 2025-05-07T20:32:01.3128354Z 2025-05-07T20:32:01.3128438Z if contiguous: 2025-05-07T20:32:01.3128602Z x0 = x0.contiguous() 2025-05-07T20:32:01.3128690Z x1 = x1.contiguous() 2025-05-07T20:32:01.3128761Z 2025-05-07T20:32:01.3128849Z if scale_ub is not None: 2025-05-07T20:32:01.3128949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3129088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3129160Z ) 2025-05-07T20:32:01.3129233Z else: 2025-05-07T20:32:01.3129329Z scale_ub_tensor = None 2025-05-07T20:32:01.3129399Z 2025-05-07T20:32:01.3129525Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3129612Z op = silu_mul_quant 2025-05-07T20:32:01.3129693Z if compiled: 2025-05-07T20:32:01.3129793Z op = torch.compile(op) 2025-05-07T20:32:01.3129894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3129964Z 2025-05-07T20:32:01.3130055Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3130060Z 2025-05-07T20:32:01.3130157Z moe/activation_test.py:117: 2025-05-07T20:32:01.3130283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3130384Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3130478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3130974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3131068Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3131420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3131643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3131975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3132064Z kernel = self.compile( 2025-05-07T20:32:01.3132445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3132613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3132739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3132747Z 2025-05-07T20:32:01.3132943Z self = 2025-05-07T20:32:01.3133776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3134270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d95940>} 2025-05-07T20:32:01.3135002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3135188Z context = 2025-05-07T20:32:01.3135192Z 2025-05-07T20:32:01.3135439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3135697Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3135803Z module_map=module_map) 2025-05-07T20:32:01.3135958Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3136057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3136132Z E ^ 2025-05-07T20:32:01.3136476Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3136480Z 2025-05-07T20:32:01.3136983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3136988Z 2025-05-07T20:32:01.3137090Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3137310Z self=, 2025-05-07T20:32:01.3137390Z T=1, 2025-05-07T20:32:01.3137464Z D=7168, 2025-05-07T20:32:01.3137548Z scale_ub=None, 2025-05-07T20:32:01.3137630Z contiguous=True, 2025-05-07T20:32:01.3137709Z compiled=False, 2025-05-07T20:32:01.3137781Z ) 2025-05-07T20:32:01.3137994Z self = 2025-05-07T20:32:01.3138150Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3138159Z 2025-05-07T20:32:01.3138232Z @given( 2025-05-07T20:32:01.3138349Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3138451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3138562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3138681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3138793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3138865Z ) 2025-05-07T20:32:01.3139104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3139201Z def test_silu_mul_quant( 2025-05-07T20:32:01.3139274Z self, 2025-05-07T20:32:01.3139350Z T: int, 2025-05-07T20:32:01.3139428Z D: int, 2025-05-07T20:32:01.3139524Z scale_ub: Optional[float], 2025-05-07T20:32:01.3139612Z contiguous: bool, 2025-05-07T20:32:01.3139693Z compiled: bool, 2025-05-07T20:32:01.3139767Z ) -> None: 2025-05-07T20:32:01.3139860Z torch.manual_seed(2025) 2025-05-07T20:32:01.3139931Z 2025-05-07T20:32:01.3140093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3140169Z 2025-05-07T20:32:01.3140258Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3140385Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3140473Z x = x_sign * x_clamp 2025-05-07T20:32:01.3140551Z x0 = x[:, :D] 2025-05-07T20:32:01.3140627Z x1 = x[:, D:] 2025-05-07T20:32:01.3140706Z 2025-05-07T20:32:01.3140785Z if contiguous: 2025-05-07T20:32:01.3140878Z x0 = x0.contiguous() 2025-05-07T20:32:01.3140963Z x1 = x1.contiguous() 2025-05-07T20:32:01.3141032Z 2025-05-07T20:32:01.3141122Z if scale_ub is not None: 2025-05-07T20:32:01.3141224Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3141354Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3141429Z ) 2025-05-07T20:32:01.3141502Z else: 2025-05-07T20:32:01.3141593Z scale_ub_tensor = None 2025-05-07T20:32:01.3141667Z 2025-05-07T20:32:01.3141792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3141878Z op = silu_mul_quant 2025-05-07T20:32:01.3141965Z if compiled: 2025-05-07T20:32:01.3142061Z op = torch.compile(op) 2025-05-07T20:32:01.3142165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3142236Z 2025-05-07T20:32:01.3142411Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3142415Z 2025-05-07T20:32:01.3142512Z moe/activation_test.py:117: 2025-05-07T20:32:01.3142637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3142735Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3142832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3143319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3143412Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3143769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3144061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3144399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3144494Z kernel = self.compile( 2025-05-07T20:32:01.3144869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3145039Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3145164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3145168Z 2025-05-07T20:32:01.3145369Z self = 2025-05-07T20:32:01.3146131Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3146628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d96ca0>} 2025-05-07T20:32:01.3147358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3147548Z context = 2025-05-07T20:32:01.3147552Z 2025-05-07T20:32:01.3147716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3149266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3149371Z module_map=module_map) 2025-05-07T20:32:01.3149533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3149633Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3149712Z E ^ 2025-05-07T20:32:01.3150054Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3150059Z 2025-05-07T20:32:01.3150465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3150470Z 2025-05-07T20:32:01.3150573Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3150788Z self=, 2025-05-07T20:32:01.3150866Z T=16384, 2025-05-07T20:32:01.3150941Z D=7168, 2025-05-07T20:32:01.3151022Z scale_ub=1200.0, 2025-05-07T20:32:01.3151111Z contiguous=False, 2025-05-07T20:32:01.3151193Z compiled=True, 2025-05-07T20:32:01.3151264Z ) 2025-05-07T20:32:01.3151480Z self = 2025-05-07T20:32:01.3151656Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3151661Z 2025-05-07T20:32:01.3151735Z @given( 2025-05-07T20:32:01.3151857Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3151953Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3152153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3152267Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3152375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3152449Z ) 2025-05-07T20:32:01.3152689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3152777Z def test_silu_mul_quant( 2025-05-07T20:32:01.3152853Z self, 2025-05-07T20:32:01.3152928Z T: int, 2025-05-07T20:32:01.3153001Z D: int, 2025-05-07T20:32:01.3153099Z scale_ub: Optional[float], 2025-05-07T20:32:01.3153184Z contiguous: bool, 2025-05-07T20:32:01.3153267Z compiled: bool, 2025-05-07T20:32:01.3153420Z ) -> None: 2025-05-07T20:32:01.3153512Z torch.manual_seed(2025) 2025-05-07T20:32:01.3153582Z 2025-05-07T20:32:01.3153749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3153828Z 2025-05-07T20:32:01.3153921Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3154041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3154125Z x = x_sign * x_clamp 2025-05-07T20:32:01.3154206Z x0 = x[:, :D] 2025-05-07T20:32:01.3154283Z x1 = x[:, D:] 2025-05-07T20:32:01.3154354Z 2025-05-07T20:32:01.3154435Z if contiguous: 2025-05-07T20:32:01.3154521Z x0 = x0.contiguous() 2025-05-07T20:32:01.3154606Z x1 = x1.contiguous() 2025-05-07T20:32:01.3154678Z 2025-05-07T20:32:01.3154765Z if scale_ub is not None: 2025-05-07T20:32:01.3154865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3155005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3155078Z ) 2025-05-07T20:32:01.3155156Z else: 2025-05-07T20:32:01.3155249Z scale_ub_tensor = None 2025-05-07T20:32:01.3155319Z 2025-05-07T20:32:01.3155465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3155565Z op = silu_mul_quant 2025-05-07T20:32:01.3155665Z if compiled: 2025-05-07T20:32:01.3155770Z op = torch.compile(op) 2025-05-07T20:32:01.3155872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3155942Z 2025-05-07T20:32:01.3156035Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3156039Z 2025-05-07T20:32:01.3156133Z moe/activation_test.py:117: 2025-05-07T20:32:01.3156262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3156359Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3156454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3156820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3156909Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3157391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3157492Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3157842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3158060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3158391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3158484Z kernel = self.compile( 2025-05-07T20:32:01.3158862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3159035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3159159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3159167Z 2025-05-07T20:32:01.3159364Z self = 2025-05-07T20:32:01.3160211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3160705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397d97f60>} 2025-05-07T20:32:01.3161432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3161692Z context = 2025-05-07T20:32:01.3161697Z 2025-05-07T20:32:01.3161861Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3162114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3162226Z module_map=module_map) 2025-05-07T20:32:01.3162382Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3162477Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3162555Z E ^ 2025-05-07T20:32:01.3162900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3162905Z 2025-05-07T20:32:01.3163308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3163312Z 2025-05-07T20:32:01.3163411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3163630Z self=, 2025-05-07T20:32:01.3163709Z T=1, 2025-05-07T20:32:01.3163782Z D=7168, 2025-05-07T20:32:01.3163867Z scale_ub=None, 2025-05-07T20:32:01.3163957Z contiguous=False, 2025-05-07T20:32:01.3164038Z compiled=False, 2025-05-07T20:32:01.3164110Z ) 2025-05-07T20:32:01.3164321Z self = 2025-05-07T20:32:01.3164481Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3164485Z 2025-05-07T20:32:01.3164560Z @given( 2025-05-07T20:32:01.3164678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3164773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3164887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3165000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3165117Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3165188Z ) 2025-05-07T20:32:01.3165450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3165558Z def test_silu_mul_quant( 2025-05-07T20:32:01.3165645Z self, 2025-05-07T20:32:01.3165719Z T: int, 2025-05-07T20:32:01.3165795Z D: int, 2025-05-07T20:32:01.3165890Z scale_ub: Optional[float], 2025-05-07T20:32:01.3165977Z contiguous: bool, 2025-05-07T20:32:01.3166065Z compiled: bool, 2025-05-07T20:32:01.3166140Z ) -> None: 2025-05-07T20:32:01.3166232Z torch.manual_seed(2025) 2025-05-07T20:32:01.3166306Z 2025-05-07T20:32:01.3166468Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3166539Z 2025-05-07T20:32:01.3166628Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3166748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3166840Z x = x_sign * x_clamp 2025-05-07T20:32:01.3170270Z x0 = x[:, :D] 2025-05-07T20:32:01.3170363Z x1 = x[:, D:] 2025-05-07T20:32:01.3170436Z 2025-05-07T20:32:01.3170518Z if contiguous: 2025-05-07T20:32:01.3170607Z x0 = x0.contiguous() 2025-05-07T20:32:01.3170804Z x1 = x1.contiguous() 2025-05-07T20:32:01.3170876Z 2025-05-07T20:32:01.3170965Z if scale_ub is not None: 2025-05-07T20:32:01.3171071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3171204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3171278Z ) 2025-05-07T20:32:01.3171356Z else: 2025-05-07T20:32:01.3171447Z scale_ub_tensor = None 2025-05-07T20:32:01.3171517Z 2025-05-07T20:32:01.3171648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3171736Z op = silu_mul_quant 2025-05-07T20:32:01.3171821Z if compiled: 2025-05-07T20:32:01.3171918Z op = torch.compile(op) 2025-05-07T20:32:01.3172119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3172196Z 2025-05-07T20:32:01.3172285Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3172290Z 2025-05-07T20:32:01.3172383Z moe/activation_test.py:117: 2025-05-07T20:32:01.3172519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3172616Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3172713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3173209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3173303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3173775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3173994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3174332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3174424Z kernel = self.compile( 2025-05-07T20:32:01.3174803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3174980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3175104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3175108Z 2025-05-07T20:32:01.3175307Z self = 2025-05-07T20:32:01.3176070Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3176567Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a149a0>} 2025-05-07T20:32:01.3177301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3177493Z context = 2025-05-07T20:32:01.3177498Z 2025-05-07T20:32:01.3177658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3177917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3178022Z module_map=module_map) 2025-05-07T20:32:01.3178183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3178279Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3178355Z E ^ 2025-05-07T20:32:01.3178709Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3178714Z 2025-05-07T20:32:01.3179115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3179207Z 2025-05-07T20:32:01.3179311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3179527Z self=, 2025-05-07T20:32:01.3179603Z T=2048, 2025-05-07T20:32:01.3179681Z D=7168, 2025-05-07T20:32:01.3179763Z scale_ub=None, 2025-05-07T20:32:01.3179849Z contiguous=False, 2025-05-07T20:32:01.3179933Z compiled=True, 2025-05-07T20:32:01.3180005Z ) 2025-05-07T20:32:01.3180217Z self = 2025-05-07T20:32:01.3180387Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3180391Z 2025-05-07T20:32:01.3180466Z @given( 2025-05-07T20:32:01.3180661Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3180765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3180877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3180995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3181112Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3181183Z ) 2025-05-07T20:32:01.3181423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3181513Z def test_silu_mul_quant( 2025-05-07T20:32:01.3181587Z self, 2025-05-07T20:32:01.3181667Z T: int, 2025-05-07T20:32:01.3181741Z D: int, 2025-05-07T20:32:01.3181839Z scale_ub: Optional[float], 2025-05-07T20:32:01.3181929Z contiguous: bool, 2025-05-07T20:32:01.3182013Z compiled: bool, 2025-05-07T20:32:01.3182095Z ) -> None: 2025-05-07T20:32:01.3182186Z torch.manual_seed(2025) 2025-05-07T20:32:01.3182255Z 2025-05-07T20:32:01.3182424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3182494Z 2025-05-07T20:32:01.3182581Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3182708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3182801Z x = x_sign * x_clamp 2025-05-07T20:32:01.3182878Z x0 = x[:, :D] 2025-05-07T20:32:01.3182958Z x1 = x[:, D:] 2025-05-07T20:32:01.3183033Z 2025-05-07T20:32:01.3183113Z if contiguous: 2025-05-07T20:32:01.3183205Z x0 = x0.contiguous() 2025-05-07T20:32:01.3183293Z x1 = x1.contiguous() 2025-05-07T20:32:01.3183369Z 2025-05-07T20:32:01.3183455Z if scale_ub is not None: 2025-05-07T20:32:01.3183560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3183693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3183767Z ) 2025-05-07T20:32:01.3183847Z else: 2025-05-07T20:32:01.3183941Z scale_ub_tensor = None 2025-05-07T20:32:01.3184015Z 2025-05-07T20:32:01.3184142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3184229Z op = silu_mul_quant 2025-05-07T20:32:01.3184314Z if compiled: 2025-05-07T20:32:01.3184415Z op = torch.compile(op) 2025-05-07T20:32:01.3184516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3184588Z 2025-05-07T20:32:01.3184676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3184680Z 2025-05-07T20:32:01.3184774Z moe/activation_test.py:117: 2025-05-07T20:32:01.3184904Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3185002Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3185102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3185463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3185552Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3186045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3186139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3186576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3186795Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3187127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3187222Z kernel = self.compile( 2025-05-07T20:32:01.3187596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3187765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3187891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3187972Z 2025-05-07T20:32:01.3188173Z self = 2025-05-07T20:32:01.3188938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3189436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a16160>} 2025-05-07T20:32:01.3190163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3190352Z context = 2025-05-07T20:32:01.3190356Z 2025-05-07T20:32:01.3190521Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3190776Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3190882Z module_map=module_map) 2025-05-07T20:32:01.3191046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3191144Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3191220Z E ^ 2025-05-07T20:32:01.3191564Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3191571Z 2025-05-07T20:32:01.3191975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3191979Z 2025-05-07T20:32:01.3192080Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3192298Z self=, 2025-05-07T20:32:01.3192379Z T=4096, 2025-05-07T20:32:01.3192456Z D=7168, 2025-05-07T20:32:01.3192543Z scale_ub=None, 2025-05-07T20:32:01.3192627Z contiguous=False, 2025-05-07T20:32:01.3192708Z compiled=True, 2025-05-07T20:32:01.3192782Z ) 2025-05-07T20:32:01.3192999Z self = 2025-05-07T20:32:01.3193168Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3193172Z 2025-05-07T20:32:01.3193247Z @given( 2025-05-07T20:32:01.3193364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3193464Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3193576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3193690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3193805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3193879Z ) 2025-05-07T20:32:01.3194124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3194217Z def test_silu_mul_quant( 2025-05-07T20:32:01.3194294Z self, 2025-05-07T20:32:01.3194371Z T: int, 2025-05-07T20:32:01.3194447Z D: int, 2025-05-07T20:32:01.3194627Z scale_ub: Optional[float], 2025-05-07T20:32:01.3194719Z contiguous: bool, 2025-05-07T20:32:01.3194803Z compiled: bool, 2025-05-07T20:32:01.3194881Z ) -> None: 2025-05-07T20:32:01.3194978Z torch.manual_seed(2025) 2025-05-07T20:32:01.3195047Z 2025-05-07T20:32:01.3195209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3195281Z 2025-05-07T20:32:01.3195370Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3195491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3195579Z x = x_sign * x_clamp 2025-05-07T20:32:01.3195655Z x0 = x[:, :D] 2025-05-07T20:32:01.3195735Z x1 = x[:, D:] 2025-05-07T20:32:01.3195806Z 2025-05-07T20:32:01.3195965Z if contiguous: 2025-05-07T20:32:01.3196057Z x0 = x0.contiguous() 2025-05-07T20:32:01.3196142Z x1 = x1.contiguous() 2025-05-07T20:32:01.3196210Z 2025-05-07T20:32:01.3196300Z if scale_ub is not None: 2025-05-07T20:32:01.3196409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3196540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3196614Z ) 2025-05-07T20:32:01.3196688Z else: 2025-05-07T20:32:01.3196780Z scale_ub_tensor = None 2025-05-07T20:32:01.3196853Z 2025-05-07T20:32:01.3196978Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3197069Z op = silu_mul_quant 2025-05-07T20:32:01.3197151Z if compiled: 2025-05-07T20:32:01.3197247Z op = torch.compile(op) 2025-05-07T20:32:01.3197351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3197420Z 2025-05-07T20:32:01.3197513Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3197517Z 2025-05-07T20:32:01.3197613Z moe/activation_test.py:117: 2025-05-07T20:32:01.3197739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3197845Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3197942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3198518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3198654Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3199156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3199254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3199614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3199841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3200176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3200275Z kernel = self.compile( 2025-05-07T20:32:01.3200653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3200834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3200963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3200967Z 2025-05-07T20:32:01.3201170Z self = 2025-05-07T20:32:01.3201935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3202436Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397a16e80>} 2025-05-07T20:32:01.3203179Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3203556Z context = 2025-05-07T20:32:01.3203561Z 2025-05-07T20:32:01.3203727Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3203985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3204093Z module_map=module_map) 2025-05-07T20:32:01.3204258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3204357Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3204437Z E ^ 2025-05-07T20:32:01.3204898Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3204903Z 2025-05-07T20:32:01.3205313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3205324Z 2025-05-07T20:32:01.3205432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3205651Z self=, 2025-05-07T20:32:01.3205728Z T=16384, 2025-05-07T20:32:01.3205809Z D=5120, 2025-05-07T20:32:01.3205892Z scale_ub=1200.0, 2025-05-07T20:32:01.3205977Z contiguous=False, 2025-05-07T20:32:01.3206064Z compiled=False, 2025-05-07T20:32:01.3206134Z ) 2025-05-07T20:32:01.3206346Z self = 2025-05-07T20:32:01.3206522Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3206526Z 2025-05-07T20:32:01.3206605Z @given( 2025-05-07T20:32:01.3206728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3206826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3206936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3207059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3207169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3207241Z ) 2025-05-07T20:32:01.3207485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3207575Z def test_silu_mul_quant( 2025-05-07T20:32:01.3207652Z self, 2025-05-07T20:32:01.3207728Z T: int, 2025-05-07T20:32:01.3207802Z D: int, 2025-05-07T20:32:01.3207901Z scale_ub: Optional[float], 2025-05-07T20:32:01.3207987Z contiguous: bool, 2025-05-07T20:32:01.3208070Z compiled: bool, 2025-05-07T20:32:01.3208153Z ) -> None: 2025-05-07T20:32:01.3208249Z torch.manual_seed(2025) 2025-05-07T20:32:01.3208320Z 2025-05-07T20:32:01.3208485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3208556Z 2025-05-07T20:32:01.3208646Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3208784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3208870Z x = x_sign * x_clamp 2025-05-07T20:32:01.3208948Z x0 = x[:, :D] 2025-05-07T20:32:01.3209032Z x1 = x[:, D:] 2025-05-07T20:32:01.3209102Z 2025-05-07T20:32:01.3209188Z if contiguous: 2025-05-07T20:32:01.3209276Z x0 = x0.contiguous() 2025-05-07T20:32:01.3209364Z x1 = x1.contiguous() 2025-05-07T20:32:01.3209437Z 2025-05-07T20:32:01.3209524Z if scale_ub is not None: 2025-05-07T20:32:01.3209626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3209759Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3209832Z ) 2025-05-07T20:32:01.3209914Z else: 2025-05-07T20:32:01.3210009Z scale_ub_tensor = None 2025-05-07T20:32:01.3210078Z 2025-05-07T20:32:01.3210201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3210380Z op = silu_mul_quant 2025-05-07T20:32:01.3210462Z if compiled: 2025-05-07T20:32:01.3210565Z op = torch.compile(op) 2025-05-07T20:32:01.3210666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3210735Z 2025-05-07T20:32:01.3210825Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3210829Z 2025-05-07T20:32:01.3210923Z moe/activation_test.py:117: 2025-05-07T20:32:01.3211049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3211149Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3211245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3211808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3211909Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3212260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3212485Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3212818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3212908Z kernel = self.compile( 2025-05-07T20:32:01.3213288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3213456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3213585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3213590Z 2025-05-07T20:32:01.3213876Z self = 2025-05-07T20:32:01.3214636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3215140Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cc220>} 2025-05-07T20:32:01.3215868Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3216064Z context = 2025-05-07T20:32:01.3216069Z 2025-05-07T20:32:01.3216230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3216487Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3216596Z module_map=module_map) 2025-05-07T20:32:01.3216754Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3216858Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3216931Z E ^ 2025-05-07T20:32:01.3217276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3217280Z 2025-05-07T20:32:01.3217688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3217692Z 2025-05-07T20:32:01.3217794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3218011Z self=, 2025-05-07T20:32:01.3218085Z T=16384, 2025-05-07T20:32:01.3218160Z D=5120, 2025-05-07T20:32:01.3218243Z scale_ub=1200.0, 2025-05-07T20:32:01.3218333Z contiguous=True, 2025-05-07T20:32:01.3218412Z compiled=True, 2025-05-07T20:32:01.3218486Z ) 2025-05-07T20:32:01.3218699Z self = 2025-05-07T20:32:01.3219048Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3219052Z 2025-05-07T20:32:01.3219129Z @given( 2025-05-07T20:32:01.3219243Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3219343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3219456Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3219569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3219683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3219754Z ) 2025-05-07T20:32:01.3219995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3220088Z def test_silu_mul_quant( 2025-05-07T20:32:01.3220160Z self, 2025-05-07T20:32:01.3220880Z T: int, 2025-05-07T20:32:01.3220962Z D: int, 2025-05-07T20:32:01.3221057Z scale_ub: Optional[float], 2025-05-07T20:32:01.3221144Z contiguous: bool, 2025-05-07T20:32:01.3221236Z compiled: bool, 2025-05-07T20:32:01.3221311Z ) -> None: 2025-05-07T20:32:01.3221405Z torch.manual_seed(2025) 2025-05-07T20:32:01.3221475Z 2025-05-07T20:32:01.3221639Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3221710Z 2025-05-07T20:32:01.3221798Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3221921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3222010Z x = x_sign * x_clamp 2025-05-07T20:32:01.3222087Z x0 = x[:, :D] 2025-05-07T20:32:01.3222163Z x1 = x[:, D:] 2025-05-07T20:32:01.3222235Z 2025-05-07T20:32:01.3222316Z if contiguous: 2025-05-07T20:32:01.3222404Z x0 = x0.contiguous() 2025-05-07T20:32:01.3222501Z x1 = x1.contiguous() 2025-05-07T20:32:01.3222570Z 2025-05-07T20:32:01.3222657Z if scale_ub is not None: 2025-05-07T20:32:01.3222765Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3222895Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3222980Z ) 2025-05-07T20:32:01.3223053Z else: 2025-05-07T20:32:01.3223143Z scale_ub_tensor = None 2025-05-07T20:32:01.3223218Z 2025-05-07T20:32:01.3223342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3223431Z op = silu_mul_quant 2025-05-07T20:32:01.3223515Z if compiled: 2025-05-07T20:32:01.3223613Z op = torch.compile(op) 2025-05-07T20:32:01.3223714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3223789Z 2025-05-07T20:32:01.3223877Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3223881Z 2025-05-07T20:32:01.3223980Z moe/activation_test.py:117: 2025-05-07T20:32:01.3224110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3224206Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3224305Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3224670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3224761Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3225248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3225341Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3225694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3225912Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3226250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3226346Z kernel = self.compile( 2025-05-07T20:32:01.3226723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3226977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3227104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3227109Z 2025-05-07T20:32:01.3227310Z self = 2025-05-07T20:32:01.3228073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3228639Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cd4e0>} 2025-05-07T20:32:01.3229374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3229565Z context = 2025-05-07T20:32:01.3229570Z 2025-05-07T20:32:01.3229732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3229991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3230096Z module_map=module_map) 2025-05-07T20:32:01.3230253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3230353Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3230428Z E ^ 2025-05-07T20:32:01.3230778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3230783Z 2025-05-07T20:32:01.3231188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3231193Z 2025-05-07T20:32:01.3231298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3231513Z self=, 2025-05-07T20:32:01.3231592Z T=16384, 2025-05-07T20:32:01.3231666Z D=5120, 2025-05-07T20:32:01.3231744Z scale_ub=None, 2025-05-07T20:32:01.3231829Z contiguous=False, 2025-05-07T20:32:01.3231909Z compiled=True, 2025-05-07T20:32:01.3231983Z ) 2025-05-07T20:32:01.3232196Z self = 2025-05-07T20:32:01.3232365Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3232369Z 2025-05-07T20:32:01.3232445Z @given( 2025-05-07T20:32:01.3232562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3232662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3232779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3232892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3233006Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3233079Z ) 2025-05-07T20:32:01.3233318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3233413Z def test_silu_mul_quant( 2025-05-07T20:32:01.3233486Z self, 2025-05-07T20:32:01.3233559Z T: int, 2025-05-07T20:32:01.3233638Z D: int, 2025-05-07T20:32:01.3233733Z scale_ub: Optional[float], 2025-05-07T20:32:01.3233818Z contiguous: bool, 2025-05-07T20:32:01.3233906Z compiled: bool, 2025-05-07T20:32:01.3233981Z ) -> None: 2025-05-07T20:32:01.3234072Z torch.manual_seed(2025) 2025-05-07T20:32:01.3234145Z 2025-05-07T20:32:01.3234311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3234382Z 2025-05-07T20:32:01.3234473Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3234593Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3234678Z x = x_sign * x_clamp 2025-05-07T20:32:01.3234875Z x0 = x[:, :D] 2025-05-07T20:32:01.3234953Z x1 = x[:, D:] 2025-05-07T20:32:01.3235030Z 2025-05-07T20:32:01.3235109Z if contiguous: 2025-05-07T20:32:01.3235197Z x0 = x0.contiguous() 2025-05-07T20:32:01.3235286Z x1 = x1.contiguous() 2025-05-07T20:32:01.3235356Z 2025-05-07T20:32:01.3235443Z if scale_ub is not None: 2025-05-07T20:32:01.3235547Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3235678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3235751Z ) 2025-05-07T20:32:01.3235829Z else: 2025-05-07T20:32:01.3235921Z scale_ub_tensor = None 2025-05-07T20:32:01.3235990Z 2025-05-07T20:32:01.3236193Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3236282Z op = silu_mul_quant 2025-05-07T20:32:01.3236368Z if compiled: 2025-05-07T20:32:01.3236465Z op = torch.compile(op) 2025-05-07T20:32:01.3236572Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3236646Z 2025-05-07T20:32:01.3236733Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3236737Z 2025-05-07T20:32:01.3236829Z moe/activation_test.py:117: 2025-05-07T20:32:01.3236959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3237056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3237151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3237515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3237606Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3238100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3238195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3238546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3238773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3239105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3239197Z kernel = self.compile( 2025-05-07T20:32:01.3239570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3239739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3239864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3239869Z 2025-05-07T20:32:01.3240072Z self = 2025-05-07T20:32:01.3240828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3241329Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43ce2a0>} 2025-05-07T20:32:01.3242063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3242251Z context = 2025-05-07T20:32:01.3242255Z 2025-05-07T20:32:01.3242417Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3242678Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3242782Z module_map=module_map) 2025-05-07T20:32:01.3243027Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3243125Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3243198Z E ^ 2025-05-07T20:32:01.3243543Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3243548Z 2025-05-07T20:32:01.3243953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3243958Z 2025-05-07T20:32:01.3244057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3244275Z self=, 2025-05-07T20:32:01.3244349Z T=2048, 2025-05-07T20:32:01.3244421Z D=5120, 2025-05-07T20:32:01.3244577Z scale_ub=None, 2025-05-07T20:32:01.3244662Z contiguous=False, 2025-05-07T20:32:01.3244742Z compiled=True, 2025-05-07T20:32:01.3244817Z ) 2025-05-07T20:32:01.3245028Z self = 2025-05-07T20:32:01.3245202Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3245209Z 2025-05-07T20:32:01.3245283Z @given( 2025-05-07T20:32:01.3245398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3245496Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3245607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3245720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3245835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3245906Z ) 2025-05-07T20:32:01.3246144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3246242Z def test_silu_mul_quant( 2025-05-07T20:32:01.3246317Z self, 2025-05-07T20:32:01.3246394Z T: int, 2025-05-07T20:32:01.3246468Z D: int, 2025-05-07T20:32:01.3246563Z scale_ub: Optional[float], 2025-05-07T20:32:01.3246657Z contiguous: bool, 2025-05-07T20:32:01.3246740Z compiled: bool, 2025-05-07T20:32:01.3246818Z ) -> None: 2025-05-07T20:32:01.3246912Z torch.manual_seed(2025) 2025-05-07T20:32:01.3246981Z 2025-05-07T20:32:01.3247142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3247216Z 2025-05-07T20:32:01.3247305Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3247426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3247519Z x = x_sign * x_clamp 2025-05-07T20:32:01.3247594Z x0 = x[:, :D] 2025-05-07T20:32:01.3247671Z x1 = x[:, D:] 2025-05-07T20:32:01.3247742Z 2025-05-07T20:32:01.3247822Z if contiguous: 2025-05-07T20:32:01.3247917Z x0 = x0.contiguous() 2025-05-07T20:32:01.3248004Z x1 = x1.contiguous() 2025-05-07T20:32:01.3248073Z 2025-05-07T20:32:01.3248161Z if scale_ub is not None: 2025-05-07T20:32:01.3248263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3248400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3248475Z ) 2025-05-07T20:32:01.3248549Z else: 2025-05-07T20:32:01.3248640Z scale_ub_tensor = None 2025-05-07T20:32:01.3248712Z 2025-05-07T20:32:01.3248836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3248923Z op = silu_mul_quant 2025-05-07T20:32:01.3249007Z if compiled: 2025-05-07T20:32:01.3249102Z op = torch.compile(op) 2025-05-07T20:32:01.3249207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3249276Z 2025-05-07T20:32:01.3249365Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3249369Z 2025-05-07T20:32:01.3249470Z moe/activation_test.py:117: 2025-05-07T20:32:01.3249596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3249692Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3249879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3250238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3250327Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3250905Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3251258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3251473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3251878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3251971Z kernel = self.compile( 2025-05-07T20:32:01.3252344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3252522Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3252646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3252650Z 2025-05-07T20:32:01.3252849Z self = 2025-05-07T20:32:01.3253719Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3254222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93a43cf560>} 2025-05-07T20:32:01.3254954Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3255144Z context = 2025-05-07T20:32:01.3255149Z 2025-05-07T20:32:01.3255321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3255626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3255732Z module_map=module_map) 2025-05-07T20:32:01.3255894Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3255989Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3256063Z E ^ 2025-05-07T20:32:01.3256413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3256418Z 2025-05-07T20:32:01.3256821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3256830Z 2025-05-07T20:32:01.3256932Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3257148Z self=, 2025-05-07T20:32:01.3257221Z T=2048, 2025-05-07T20:32:01.3257296Z D=5120, 2025-05-07T20:32:01.3257376Z scale_ub=1200.0, 2025-05-07T20:32:01.3257458Z contiguous=False, 2025-05-07T20:32:01.3257541Z compiled=True, 2025-05-07T20:32:01.3257612Z ) 2025-05-07T20:32:01.3257823Z self = 2025-05-07T20:32:01.3257994Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3257998Z 2025-05-07T20:32:01.3258072Z @given( 2025-05-07T20:32:01.3258193Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3258289Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3258400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3258607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3258719Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3258791Z ) 2025-05-07T20:32:01.3259036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3259126Z def test_silu_mul_quant( 2025-05-07T20:32:01.3259199Z self, 2025-05-07T20:32:01.3259275Z T: int, 2025-05-07T20:32:01.3259348Z D: int, 2025-05-07T20:32:01.3259446Z scale_ub: Optional[float], 2025-05-07T20:32:01.3259531Z contiguous: bool, 2025-05-07T20:32:01.3259614Z compiled: bool, 2025-05-07T20:32:01.3259694Z ) -> None: 2025-05-07T20:32:01.3259784Z torch.manual_seed(2025) 2025-05-07T20:32:01.3259857Z 2025-05-07T20:32:01.3260119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3260191Z 2025-05-07T20:32:01.3260279Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3260402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3260492Z x = x_sign * x_clamp 2025-05-07T20:32:01.3260573Z x0 = x[:, :D] 2025-05-07T20:32:01.3260655Z x1 = x[:, D:] 2025-05-07T20:32:01.3260724Z 2025-05-07T20:32:01.3260803Z if contiguous: 2025-05-07T20:32:01.3260894Z x0 = x0.contiguous() 2025-05-07T20:32:01.3260980Z x1 = x1.contiguous() 2025-05-07T20:32:01.3261053Z 2025-05-07T20:32:01.3261141Z if scale_ub is not None: 2025-05-07T20:32:01.3261243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3261378Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3261451Z ) 2025-05-07T20:32:01.3261525Z else: 2025-05-07T20:32:01.3261626Z scale_ub_tensor = None 2025-05-07T20:32:01.3261696Z 2025-05-07T20:32:01.3261819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3261910Z op = silu_mul_quant 2025-05-07T20:32:01.3261996Z if compiled: 2025-05-07T20:32:01.3262094Z op = torch.compile(op) 2025-05-07T20:32:01.3262198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3262267Z 2025-05-07T20:32:01.3262357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3262362Z 2025-05-07T20:32:01.3262460Z moe/activation_test.py:117: 2025-05-07T20:32:01.3262585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3262685Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3262781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3263141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3263242Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3263726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3263824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3264179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3264395Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3264732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3264822Z kernel = self.compile( 2025-05-07T20:32:01.3265197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3265369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3265497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3265501Z 2025-05-07T20:32:01.3265702Z self = 2025-05-07T20:32:01.3266461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3267038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397618c20>} 2025-05-07T20:32:01.3267772Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3267957Z context = 2025-05-07T20:32:01.3267961Z 2025-05-07T20:32:01.3268200Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3268455Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3268563Z module_map=module_map) 2025-05-07T20:32:01.3268726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3268820Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3268900Z E ^ 2025-05-07T20:32:01.3269243Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3269247Z 2025-05-07T20:32:01.3269648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3269653Z 2025-05-07T20:32:01.3269754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3269969Z self=, 2025-05-07T20:32:01.3270049Z T=4096, 2025-05-07T20:32:01.3270123Z D=5120, 2025-05-07T20:32:01.3270203Z scale_ub=1200.0, 2025-05-07T20:32:01.3270286Z contiguous=True, 2025-05-07T20:32:01.3270367Z compiled=True, 2025-05-07T20:32:01.3270444Z ) 2025-05-07T20:32:01.3270660Z self = 2025-05-07T20:32:01.3270827Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3270831Z 2025-05-07T20:32:01.3270903Z @given( 2025-05-07T20:32:01.3271021Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3271116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3271230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3271343Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3271452Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3271528Z ) 2025-05-07T20:32:01.3271772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3271861Z def test_silu_mul_quant( 2025-05-07T20:32:01.3271939Z self, 2025-05-07T20:32:01.3272013Z T: int, 2025-05-07T20:32:01.3272086Z D: int, 2025-05-07T20:32:01.3272191Z scale_ub: Optional[float], 2025-05-07T20:32:01.3272275Z contiguous: bool, 2025-05-07T20:32:01.3272358Z compiled: bool, 2025-05-07T20:32:01.3272438Z ) -> None: 2025-05-07T20:32:01.3272527Z torch.manual_seed(2025) 2025-05-07T20:32:01.3272600Z 2025-05-07T20:32:01.3272762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3272832Z 2025-05-07T20:32:01.3272921Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3273041Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3273126Z x = x_sign * x_clamp 2025-05-07T20:32:01.3273211Z x0 = x[:, :D] 2025-05-07T20:32:01.3273287Z x1 = x[:, D:] 2025-05-07T20:32:01.3273357Z 2025-05-07T20:32:01.3273446Z if contiguous: 2025-05-07T20:32:01.3273534Z x0 = x0.contiguous() 2025-05-07T20:32:01.3273621Z x1 = x1.contiguous() 2025-05-07T20:32:01.3273694Z 2025-05-07T20:32:01.3273780Z if scale_ub is not None: 2025-05-07T20:32:01.3273979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3274110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3274184Z ) 2025-05-07T20:32:01.3274261Z else: 2025-05-07T20:32:01.3274353Z scale_ub_tensor = None 2025-05-07T20:32:01.3274424Z 2025-05-07T20:32:01.3274551Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3274637Z op = silu_mul_quant 2025-05-07T20:32:01.3274718Z if compiled: 2025-05-07T20:32:01.3274818Z op = torch.compile(op) 2025-05-07T20:32:01.3274919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3274988Z 2025-05-07T20:32:01.3275204Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3275209Z 2025-05-07T20:32:01.3275304Z moe/activation_test.py:117: 2025-05-07T20:32:01.3275434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3275538Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3275635Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3275999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3276090Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3276573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3276669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3277018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3277243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3277576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3277665Z kernel = self.compile( 2025-05-07T20:32:01.3278051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3278219Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3278344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3278351Z 2025-05-07T20:32:01.3278549Z self = 2025-05-07T20:32:01.3279307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3279804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397619a80>} 2025-05-07T20:32:01.3280531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3280722Z context = 2025-05-07T20:32:01.3280727Z 2025-05-07T20:32:01.3280887Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3281140Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3281251Z module_map=module_map) 2025-05-07T20:32:01.3281410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3281506Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3281581Z E ^ 2025-05-07T20:32:01.3281928Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3281933Z 2025-05-07T20:32:01.3282340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3282426Z 2025-05-07T20:32:01.3282527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3282743Z self=, 2025-05-07T20:32:01.3282825Z T=128, 2025-05-07T20:32:01.3282901Z D=5120, 2025-05-07T20:32:01.3282985Z scale_ub=1200.0, 2025-05-07T20:32:01.3283068Z contiguous=False, 2025-05-07T20:32:01.3283149Z compiled=True, 2025-05-07T20:32:01.3283221Z ) 2025-05-07T20:32:01.3283434Z self = 2025-05-07T20:32:01.3283601Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3283680Z 2025-05-07T20:32:01.3283757Z @given( 2025-05-07T20:32:01.3283873Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3283970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3284086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3284205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3284319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3284392Z ) 2025-05-07T20:32:01.3284629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3284721Z def test_silu_mul_quant( 2025-05-07T20:32:01.3284795Z self, 2025-05-07T20:32:01.3284869Z T: int, 2025-05-07T20:32:01.3284946Z D: int, 2025-05-07T20:32:01.3285041Z scale_ub: Optional[float], 2025-05-07T20:32:01.3285130Z contiguous: bool, 2025-05-07T20:32:01.3285216Z compiled: bool, 2025-05-07T20:32:01.3285309Z ) -> None: 2025-05-07T20:32:01.3285412Z torch.manual_seed(2025) 2025-05-07T20:32:01.3285504Z 2025-05-07T20:32:01.3285670Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3285745Z 2025-05-07T20:32:01.3285833Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3285958Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3286045Z x = x_sign * x_clamp 2025-05-07T20:32:01.3286123Z x0 = x[:, :D] 2025-05-07T20:32:01.3286200Z x1 = x[:, D:] 2025-05-07T20:32:01.3286273Z 2025-05-07T20:32:01.3286352Z if contiguous: 2025-05-07T20:32:01.3286440Z x0 = x0.contiguous() 2025-05-07T20:32:01.3286531Z x1 = x1.contiguous() 2025-05-07T20:32:01.3286601Z 2025-05-07T20:32:01.3286688Z if scale_ub is not None: 2025-05-07T20:32:01.3286792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3286922Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3287004Z ) 2025-05-07T20:32:01.3290307Z else: 2025-05-07T20:32:01.3290418Z scale_ub_tensor = None 2025-05-07T20:32:01.3290494Z 2025-05-07T20:32:01.3290626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3290722Z op = silu_mul_quant 2025-05-07T20:32:01.3290808Z if compiled: 2025-05-07T20:32:01.3290908Z op = torch.compile(op) 2025-05-07T20:32:01.3291014Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3291087Z 2025-05-07T20:32:01.3291178Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3291183Z 2025-05-07T20:32:01.3291279Z moe/activation_test.py:117: 2025-05-07T20:32:01.3291414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3291513Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3291614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3291980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3292070Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3292556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3292760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3293109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3293331Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3293758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3293852Z kernel = self.compile( 2025-05-07T20:32:01.3294227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3294395Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3294625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3294630Z 2025-05-07T20:32:01.3294831Z self = 2025-05-07T20:32:01.3295603Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3296094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f939761aca0>} 2025-05-07T20:32:01.3296827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3297021Z context = 2025-05-07T20:32:01.3297026Z 2025-05-07T20:32:01.3297185Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3297443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3297552Z module_map=module_map) 2025-05-07T20:32:01.3297710Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3297814Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3297887Z E ^ 2025-05-07T20:32:01.3298425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3298438Z 2025-05-07T20:32:01.3298914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3298919Z 2025-05-07T20:32:01.3299020Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3299248Z self=, 2025-05-07T20:32:01.3299325Z T=16384, 2025-05-07T20:32:01.3299405Z D=7168, 2025-05-07T20:32:01.3299488Z scale_ub=1200.0, 2025-05-07T20:32:01.3299571Z contiguous=True, 2025-05-07T20:32:01.3299657Z compiled=True, 2025-05-07T20:32:01.3299731Z ) 2025-05-07T20:32:01.3299943Z self = 2025-05-07T20:32:01.3300114Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3300118Z 2025-05-07T20:32:01.3300191Z @given( 2025-05-07T20:32:01.3300308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3300404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3300517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3300630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3300742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3300815Z ) 2025-05-07T20:32:01.3301062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3301153Z def test_silu_mul_quant( 2025-05-07T20:32:01.3301226Z self, 2025-05-07T20:32:01.3301459Z T: int, 2025-05-07T20:32:01.3301535Z D: int, 2025-05-07T20:32:01.3301631Z scale_ub: Optional[float], 2025-05-07T20:32:01.3301721Z contiguous: bool, 2025-05-07T20:32:01.3301805Z compiled: bool, 2025-05-07T20:32:01.3301881Z ) -> None: 2025-05-07T20:32:01.3301976Z torch.manual_seed(2025) 2025-05-07T20:32:01.3302045Z 2025-05-07T20:32:01.3302209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3302284Z 2025-05-07T20:32:01.3302373Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3302494Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3302582Z x = x_sign * x_clamp 2025-05-07T20:32:01.3302660Z x0 = x[:, :D] 2025-05-07T20:32:01.3302859Z x1 = x[:, D:] 2025-05-07T20:32:01.3302933Z 2025-05-07T20:32:01.3303013Z if contiguous: 2025-05-07T20:32:01.3303103Z x0 = x0.contiguous() 2025-05-07T20:32:01.3303190Z x1 = x1.contiguous() 2025-05-07T20:32:01.3303266Z 2025-05-07T20:32:01.3303358Z if scale_ub is not None: 2025-05-07T20:32:01.3303462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3303594Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3303670Z ) 2025-05-07T20:32:01.3303746Z else: 2025-05-07T20:32:01.3303836Z scale_ub_tensor = None 2025-05-07T20:32:01.3303910Z 2025-05-07T20:32:01.3304037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3304133Z op = silu_mul_quant 2025-05-07T20:32:01.3304212Z if compiled: 2025-05-07T20:32:01.3304310Z op = torch.compile(op) 2025-05-07T20:32:01.3304420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3304491Z 2025-05-07T20:32:01.3304584Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3304589Z 2025-05-07T20:32:01.3304681Z moe/activation_test.py:117: 2025-05-07T20:32:01.3304810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3304914Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3305009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3305371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3305460Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3305940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3306036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3306387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3306614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3306945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3307041Z kernel = self.compile( 2025-05-07T20:32:01.3307419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3307589Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3307717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3307722Z 2025-05-07T20:32:01.3307920Z self = 2025-05-07T20:32:01.3308686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3309184Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978e8400>} 2025-05-07T20:32:01.3310000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3310186Z context = 2025-05-07T20:32:01.3310190Z 2025-05-07T20:32:01.3310352Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3310606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3310713Z module_map=module_map) 2025-05-07T20:32:01.3310871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3311042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3311119Z E ^ 2025-05-07T20:32:01.3311463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3311474Z 2025-05-07T20:32:01.3311879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3311884Z 2025-05-07T20:32:01.3311987Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3312206Z self=, 2025-05-07T20:32:01.3312281Z T=16384, 2025-05-07T20:32:01.3312354Z D=5120, 2025-05-07T20:32:01.3312438Z scale_ub=1200.0, 2025-05-07T20:32:01.3312517Z contiguous=True, 2025-05-07T20:32:01.3312597Z compiled=False, 2025-05-07T20:32:01.3312672Z ) 2025-05-07T20:32:01.3312885Z self = 2025-05-07T20:32:01.3313062Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3313066Z 2025-05-07T20:32:01.3313142Z @given( 2025-05-07T20:32:01.3313256Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3313354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3313471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3313583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3313696Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3313767Z ) 2025-05-07T20:32:01.3314007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3314098Z def test_silu_mul_quant( 2025-05-07T20:32:01.3314171Z self, 2025-05-07T20:32:01.3314247Z T: int, 2025-05-07T20:32:01.3314323Z D: int, 2025-05-07T20:32:01.3314417Z scale_ub: Optional[float], 2025-05-07T20:32:01.3314504Z contiguous: bool, 2025-05-07T20:32:01.3314590Z compiled: bool, 2025-05-07T20:32:01.3314672Z ) -> None: 2025-05-07T20:32:01.3314764Z torch.manual_seed(2025) 2025-05-07T20:32:01.3314838Z 2025-05-07T20:32:01.3315001Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3315078Z 2025-05-07T20:32:01.3315166Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3315285Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3315374Z x = x_sign * x_clamp 2025-05-07T20:32:01.3315452Z x0 = x[:, :D] 2025-05-07T20:32:01.3315528Z x1 = x[:, D:] 2025-05-07T20:32:01.3315602Z 2025-05-07T20:32:01.3315682Z if contiguous: 2025-05-07T20:32:01.3315770Z x0 = x0.contiguous() 2025-05-07T20:32:01.3315858Z x1 = x1.contiguous() 2025-05-07T20:32:01.3315929Z 2025-05-07T20:32:01.3316018Z if scale_ub is not None: 2025-05-07T20:32:01.3316121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3316256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3316331Z ) 2025-05-07T20:32:01.3316405Z else: 2025-05-07T20:32:01.3316497Z scale_ub_tensor = None 2025-05-07T20:32:01.3316569Z 2025-05-07T20:32:01.3316781Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3316867Z op = silu_mul_quant 2025-05-07T20:32:01.3316952Z if compiled: 2025-05-07T20:32:01.3317049Z op = torch.compile(op) 2025-05-07T20:32:01.3317150Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3317222Z 2025-05-07T20:32:01.3317311Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3317315Z 2025-05-07T20:32:01.3317412Z moe/activation_test.py:117: 2025-05-07T20:32:01.3317540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3317637Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3317738Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3318298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3318393Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3318748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3318971Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3319307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3319396Z kernel = self.compile( 2025-05-07T20:32:01.3319771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3319944Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3320067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3320072Z 2025-05-07T20:32:01.3320272Z self = 2025-05-07T20:32:01.3321032Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3321529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978e8e00>} 2025-05-07T20:32:01.3322264Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3322449Z context = 2025-05-07T20:32:01.3322454Z 2025-05-07T20:32:01.3322622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3322876Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3322981Z module_map=module_map) 2025-05-07T20:32:01.3323148Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3323243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3323317Z E ^ 2025-05-07T20:32:01.3323663Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3323667Z 2025-05-07T20:32:01.3324070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3324074Z 2025-05-07T20:32:01.3324177Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3324395Z self=, 2025-05-07T20:32:01.3324468Z T=1, 2025-05-07T20:32:01.3324551Z D=7168, 2025-05-07T20:32:01.3324632Z scale_ub=1200.0, 2025-05-07T20:32:01.3324714Z contiguous=False, 2025-05-07T20:32:01.3324798Z compiled=False, 2025-05-07T20:32:01.3324867Z ) 2025-05-07T20:32:01.3325084Z self = 2025-05-07T20:32:01.3325355Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3325359Z 2025-05-07T20:32:01.3325433Z @given( 2025-05-07T20:32:01.3325554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3325649Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3325760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3325878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3325988Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3326060Z ) 2025-05-07T20:32:01.3326304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3326467Z def test_silu_mul_quant( 2025-05-07T20:32:01.3326544Z self, 2025-05-07T20:32:01.3326617Z T: int, 2025-05-07T20:32:01.3326690Z D: int, 2025-05-07T20:32:01.3326796Z scale_ub: Optional[float], 2025-05-07T20:32:01.3326889Z contiguous: bool, 2025-05-07T20:32:01.3326970Z compiled: bool, 2025-05-07T20:32:01.3327049Z ) -> None: 2025-05-07T20:32:01.3327141Z torch.manual_seed(2025) 2025-05-07T20:32:01.3327211Z 2025-05-07T20:32:01.3327377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3327447Z 2025-05-07T20:32:01.3327536Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3327660Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3327746Z x = x_sign * x_clamp 2025-05-07T20:32:01.3327825Z x0 = x[:, :D] 2025-05-07T20:32:01.3327902Z x1 = x[:, D:] 2025-05-07T20:32:01.3327971Z 2025-05-07T20:32:01.3328059Z if contiguous: 2025-05-07T20:32:01.3328151Z x0 = x0.contiguous() 2025-05-07T20:32:01.3328238Z x1 = x1.contiguous() 2025-05-07T20:32:01.3328309Z 2025-05-07T20:32:01.3328399Z if scale_ub is not None: 2025-05-07T20:32:01.3328505Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3328644Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3328717Z ) 2025-05-07T20:32:01.3328791Z else: 2025-05-07T20:32:01.3328887Z scale_ub_tensor = None 2025-05-07T20:32:01.3328957Z 2025-05-07T20:32:01.3329082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3329172Z op = silu_mul_quant 2025-05-07T20:32:01.3329254Z if compiled: 2025-05-07T20:32:01.3329352Z op = torch.compile(op) 2025-05-07T20:32:01.3329453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3329522Z 2025-05-07T20:32:01.3329612Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3329616Z 2025-05-07T20:32:01.3329714Z moe/activation_test.py:117: 2025-05-07T20:32:01.3329840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3329939Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3330039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3330530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3330622Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3330976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3331196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3331529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3331618Z kernel = self.compile( 2025-05-07T20:32:01.3332002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3332172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3332302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3332388Z 2025-05-07T20:32:01.3332587Z self = 2025-05-07T20:32:01.3333345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3333940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978ea160>} 2025-05-07T20:32:01.3334744Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3334933Z context = 2025-05-07T20:32:01.3334943Z 2025-05-07T20:32:01.3335105Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3335361Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3335466Z module_map=module_map) 2025-05-07T20:32:01.3335623Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3335725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3335797Z E ^ 2025-05-07T20:32:01.3336140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3336144Z 2025-05-07T20:32:01.3336553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3336557Z 2025-05-07T20:32:01.3336657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3336877Z self=, 2025-05-07T20:32:01.3336956Z T=4096, 2025-05-07T20:32:01.3337030Z D=7168, 2025-05-07T20:32:01.3337113Z scale_ub=1200.0, 2025-05-07T20:32:01.3337196Z contiguous=False, 2025-05-07T20:32:01.3337275Z compiled=True, 2025-05-07T20:32:01.3337349Z ) 2025-05-07T20:32:01.3337559Z self = 2025-05-07T20:32:01.3337728Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3337733Z 2025-05-07T20:32:01.3337811Z @given( 2025-05-07T20:32:01.3337930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3338028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3338143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3338257Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3338368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3338441Z ) 2025-05-07T20:32:01.3338685Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3338777Z def test_silu_mul_quant( 2025-05-07T20:32:01.3338850Z self, 2025-05-07T20:32:01.3338924Z T: int, 2025-05-07T20:32:01.3339001Z D: int, 2025-05-07T20:32:01.3339095Z scale_ub: Optional[float], 2025-05-07T20:32:01.3339184Z contiguous: bool, 2025-05-07T20:32:01.3339267Z compiled: bool, 2025-05-07T20:32:01.3339344Z ) -> None: 2025-05-07T20:32:01.3339440Z torch.manual_seed(2025) 2025-05-07T20:32:01.3339511Z 2025-05-07T20:32:01.3339673Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3339745Z 2025-05-07T20:32:01.3339838Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3339958Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3340046Z x = x_sign * x_clamp 2025-05-07T20:32:01.3340123Z x0 = x[:, :D] 2025-05-07T20:32:01.3340199Z x1 = x[:, D:] 2025-05-07T20:32:01.3340357Z 2025-05-07T20:32:01.3340437Z if contiguous: 2025-05-07T20:32:01.3340526Z x0 = x0.contiguous() 2025-05-07T20:32:01.3340614Z x1 = x1.contiguous() 2025-05-07T20:32:01.3340683Z 2025-05-07T20:32:01.3340773Z if scale_ub is not None: 2025-05-07T20:32:01.3340874Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3341004Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3341078Z ) 2025-05-07T20:32:01.3341150Z else: 2025-05-07T20:32:01.3341240Z scale_ub_tensor = None 2025-05-07T20:32:01.3341311Z 2025-05-07T20:32:01.3341436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3341684Z op = silu_mul_quant 2025-05-07T20:32:01.3341772Z if compiled: 2025-05-07T20:32:01.3341868Z op = torch.compile(op) 2025-05-07T20:32:01.3341969Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3342049Z 2025-05-07T20:32:01.3342136Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3342141Z 2025-05-07T20:32:01.3342241Z moe/activation_test.py:117: 2025-05-07T20:32:01.3342365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3342461Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3342559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3342919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3343009Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3343495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3343593Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3343946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3344163Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3344499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3344592Z kernel = self.compile( 2025-05-07T20:32:01.3344966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3345136Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3345264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3345268Z 2025-05-07T20:32:01.3345465Z self = 2025-05-07T20:32:01.3346234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3346729Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93978eb420>} 2025-05-07T20:32:01.3347460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3347644Z context = 2025-05-07T20:32:01.3347649Z 2025-05-07T20:32:01.3347808Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3348070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3348176Z module_map=module_map) 2025-05-07T20:32:01.3348337Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3348430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3348590Z E ^ 2025-05-07T20:32:01.3348935Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3348940Z 2025-05-07T20:32:01.3349341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3349345Z 2025-05-07T20:32:01.3349444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3349663Z self=, 2025-05-07T20:32:01.3349737Z T=128, 2025-05-07T20:32:01.3349817Z D=7168, 2025-05-07T20:32:01.3349897Z scale_ub=1200.0, 2025-05-07T20:32:01.3349979Z contiguous=False, 2025-05-07T20:32:01.3350140Z compiled=True, 2025-05-07T20:32:01.3350213Z ) 2025-05-07T20:32:01.3350424Z self = 2025-05-07T20:32:01.3350592Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:01.3350601Z 2025-05-07T20:32:01.3350680Z @given( 2025-05-07T20:32:01.3350793Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3350887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3351003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3351115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3351224Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3351297Z ) 2025-05-07T20:32:01.3351535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3351629Z def test_silu_mul_quant( 2025-05-07T20:32:01.3351701Z self, 2025-05-07T20:32:01.3351775Z T: int, 2025-05-07T20:32:01.3351857Z D: int, 2025-05-07T20:32:01.3351950Z scale_ub: Optional[float], 2025-05-07T20:32:01.3352035Z contiguous: bool, 2025-05-07T20:32:01.3352123Z compiled: bool, 2025-05-07T20:32:01.3352203Z ) -> None: 2025-05-07T20:32:01.3352294Z torch.manual_seed(2025) 2025-05-07T20:32:01.3352367Z 2025-05-07T20:32:01.3352528Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3352597Z 2025-05-07T20:32:01.3352688Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3352808Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3352896Z x = x_sign * x_clamp 2025-05-07T20:32:01.3352973Z x0 = x[:, :D] 2025-05-07T20:32:01.3353049Z x1 = x[:, D:] 2025-05-07T20:32:01.3353126Z 2025-05-07T20:32:01.3353206Z if contiguous: 2025-05-07T20:32:01.3353293Z x0 = x0.contiguous() 2025-05-07T20:32:01.3353381Z x1 = x1.contiguous() 2025-05-07T20:32:01.3353454Z 2025-05-07T20:32:01.3353542Z if scale_ub is not None: 2025-05-07T20:32:01.3353645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3353777Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3353854Z ) 2025-05-07T20:32:01.3353931Z else: 2025-05-07T20:32:01.3354025Z scale_ub_tensor = None 2025-05-07T20:32:01.3354100Z 2025-05-07T20:32:01.3354224Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3354310Z op = silu_mul_quant 2025-05-07T20:32:01.3354393Z if compiled: 2025-05-07T20:32:01.3354488Z op = torch.compile(op) 2025-05-07T20:32:01.3354590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3354661Z 2025-05-07T20:32:01.3354749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3354754Z 2025-05-07T20:32:01.3354847Z moe/activation_test.py:117: 2025-05-07T20:32:01.3354980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3355078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3355173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3355536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3355736Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3356225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3356318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3356668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3356891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3357224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3357394Z kernel = self.compile( 2025-05-07T20:32:01.3357770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3357939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3358073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3358078Z 2025-05-07T20:32:01.3358278Z self = 2025-05-07T20:32:01.3359040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3359532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93975cc720>} 2025-05-07T20:32:01.3360265Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3360459Z context = 2025-05-07T20:32:01.3360463Z 2025-05-07T20:32:01.3360624Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3360880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3360984Z module_map=module_map) 2025-05-07T20:32:01.3361142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3361239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3361312Z E ^ 2025-05-07T20:32:01.3361655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3361662Z 2025-05-07T20:32:01.3362071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3362076Z 2025-05-07T20:32:01.3362175Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3362400Z self=, 2025-05-07T20:32:01.3362474Z T=2048, 2025-05-07T20:32:01.3362547Z D=7168, 2025-05-07T20:32:01.3362628Z scale_ub=None, 2025-05-07T20:32:01.3362709Z contiguous=True, 2025-05-07T20:32:01.3362789Z compiled=True, 2025-05-07T20:32:01.3362861Z ) 2025-05-07T20:32:01.3363073Z self = 2025-05-07T20:32:01.3363242Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.3363246Z 2025-05-07T20:32:01.3363320Z @given( 2025-05-07T20:32:01.3363434Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3363537Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3363649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3363761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3363874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3364030Z ) 2025-05-07T20:32:01.3364270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3364363Z def test_silu_mul_quant( 2025-05-07T20:32:01.3364435Z self, 2025-05-07T20:32:01.3364513Z T: int, 2025-05-07T20:32:01.3364587Z D: int, 2025-05-07T20:32:01.3364681Z scale_ub: Optional[float], 2025-05-07T20:32:01.3364768Z contiguous: bool, 2025-05-07T20:32:01.3364850Z compiled: bool, 2025-05-07T20:32:01.3364926Z ) -> None: 2025-05-07T20:32:01.3365023Z torch.manual_seed(2025) 2025-05-07T20:32:01.3365093Z 2025-05-07T20:32:01.3365254Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3365420Z 2025-05-07T20:32:01.3365518Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3365663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3365750Z x = x_sign * x_clamp 2025-05-07T20:32:01.3365833Z x0 = x[:, :D] 2025-05-07T20:32:01.3365913Z x1 = x[:, D:] 2025-05-07T20:32:01.3365982Z 2025-05-07T20:32:01.3366065Z if contiguous: 2025-05-07T20:32:01.3366155Z x0 = x0.contiguous() 2025-05-07T20:32:01.3366242Z x1 = x1.contiguous() 2025-05-07T20:32:01.3366310Z 2025-05-07T20:32:01.3366399Z if scale_ub is not None: 2025-05-07T20:32:01.3366500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3366635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3366711Z ) 2025-05-07T20:32:01.3366784Z else: 2025-05-07T20:32:01.3366874Z scale_ub_tensor = None 2025-05-07T20:32:01.3366945Z 2025-05-07T20:32:01.3367074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3367163Z op = silu_mul_quant 2025-05-07T20:32:01.3367244Z if compiled: 2025-05-07T20:32:01.3367339Z op = torch.compile(op) 2025-05-07T20:32:01.3367448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3367518Z 2025-05-07T20:32:01.3367605Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3367609Z 2025-05-07T20:32:01.3367705Z moe/activation_test.py:117: 2025-05-07T20:32:01.3367830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3367927Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3368025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3368383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3368475Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3368963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3369055Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3369408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3369628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3369962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3370054Z kernel = self.compile( 2025-05-07T20:32:01.3370428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3370600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3370724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3370729Z 2025-05-07T20:32:01.3370930Z self = 2025-05-07T20:32:01.3371691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3372268Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93975cd440>} 2025-05-07T20:32:01.3372999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3373183Z context = 2025-05-07T20:32:01.3373188Z 2025-05-07T20:32:01.3373350Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3373790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3373896Z module_map=module_map) 2025-05-07T20:32:01.3374057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3374157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3374231Z E ^ 2025-05-07T20:32:01.3374577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3374581Z 2025-05-07T20:32:01.3374985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3374990Z 2025-05-07T20:32:01.3375093Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3375307Z self=, 2025-05-07T20:32:01.3375381Z T=16384, 2025-05-07T20:32:01.3375458Z D=5120, 2025-05-07T20:32:01.3375536Z scale_ub=None, 2025-05-07T20:32:01.3375625Z contiguous=False, 2025-05-07T20:32:01.3375710Z compiled=False, 2025-05-07T20:32:01.3375781Z ) 2025-05-07T20:32:01.3375993Z self = 2025-05-07T20:32:01.3376170Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3376175Z 2025-05-07T20:32:01.3376248Z @given( 2025-05-07T20:32:01.3376366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3376461Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3376571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3376686Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3376794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3376865Z ) 2025-05-07T20:32:01.3377106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3377195Z def test_silu_mul_quant( 2025-05-07T20:32:01.3377272Z self, 2025-05-07T20:32:01.3377350Z T: int, 2025-05-07T20:32:01.3377424Z D: int, 2025-05-07T20:32:01.3377521Z scale_ub: Optional[float], 2025-05-07T20:32:01.3377606Z contiguous: bool, 2025-05-07T20:32:01.3377692Z compiled: bool, 2025-05-07T20:32:01.3377770Z ) -> None: 2025-05-07T20:32:01.3377862Z torch.manual_seed(2025) 2025-05-07T20:32:01.3377932Z 2025-05-07T20:32:01.3378096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3378166Z 2025-05-07T20:32:01.3378253Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3378376Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3380134Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3380222Z 2025-05-07T20:32:01.3380340Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.3380344Z 2025-05-07T20:32:01.3380443Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3380665Z self=, 2025-05-07T20:32:01.3380740Z T=4096, 2025-05-07T20:32:01.3380812Z D=7168, 2025-05-07T20:32:01.3380897Z scale_ub=1200.0, 2025-05-07T20:32:01.3380978Z contiguous=True, 2025-05-07T20:32:01.3381057Z compiled=True, 2025-05-07T20:32:01.3381129Z ) 2025-05-07T20:32:01.3381340Z self = 2025-05-07T20:32:01.3381613Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3381618Z 2025-05-07T20:32:01.3381697Z @given( 2025-05-07T20:32:01.3381814Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3381912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3382028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3382139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3382251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3382323Z ) 2025-05-07T20:32:01.3382561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3382656Z def test_silu_mul_quant( 2025-05-07T20:32:01.3382729Z self, 2025-05-07T20:32:01.3382802Z T: int, 2025-05-07T20:32:01.3382878Z D: int, 2025-05-07T20:32:01.3382972Z scale_ub: Optional[float], 2025-05-07T20:32:01.3383057Z contiguous: bool, 2025-05-07T20:32:01.3383143Z compiled: bool, 2025-05-07T20:32:01.3383224Z ) -> None: 2025-05-07T20:32:01.3383321Z torch.manual_seed(2025) 2025-05-07T20:32:01.3383390Z 2025-05-07T20:32:01.3383551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3383632Z 2025-05-07T20:32:01.3383719Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3383839Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3385624Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3385634Z 2025-05-07T20:32:01.3385747Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.3385752Z 2025-05-07T20:32:01.3385856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3386074Z self=, 2025-05-07T20:32:01.3386154Z T=16384, 2025-05-07T20:32:01.3386230Z D=7168, 2025-05-07T20:32:01.3386308Z scale_ub=None, 2025-05-07T20:32:01.3386391Z contiguous=False, 2025-05-07T20:32:01.3386471Z compiled=False, 2025-05-07T20:32:01.3386541Z ) 2025-05-07T20:32:01.3386753Z self = 2025-05-07T20:32:01.3386923Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3386927Z 2025-05-07T20:32:01.3386999Z @given( 2025-05-07T20:32:01.3387116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3387210Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3387324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3387439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3387549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3387709Z ) 2025-05-07T20:32:01.3387948Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3388037Z def test_silu_mul_quant( 2025-05-07T20:32:01.3388114Z self, 2025-05-07T20:32:01.3388188Z T: int, 2025-05-07T20:32:01.3388261Z D: int, 2025-05-07T20:32:01.3388362Z scale_ub: Optional[float], 2025-05-07T20:32:01.3388447Z contiguous: bool, 2025-05-07T20:32:01.3388531Z compiled: bool, 2025-05-07T20:32:01.3388608Z ) -> None: 2025-05-07T20:32:01.3388698Z torch.manual_seed(2025) 2025-05-07T20:32:01.3388768Z 2025-05-07T20:32:01.3388931Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3390739Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3390754Z 2025-05-07T20:32:01.3390869Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3390873Z 2025-05-07T20:32:01.3390972Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3391191Z self=, 2025-05-07T20:32:01.3391264Z T=2048, 2025-05-07T20:32:01.3391336Z D=7168, 2025-05-07T20:32:01.3391417Z scale_ub=1200.0, 2025-05-07T20:32:01.3391501Z contiguous=True, 2025-05-07T20:32:01.3391580Z compiled=True, 2025-05-07T20:32:01.3391652Z ) 2025-05-07T20:32:01.3391861Z self = 2025-05-07T20:32:01.3392023Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3392037Z 2025-05-07T20:32:01.3392110Z @given( 2025-05-07T20:32:01.3392223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3392321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3392431Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3392542Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3392654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3392725Z ) 2025-05-07T20:32:01.3392962Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3393056Z def test_silu_mul_quant( 2025-05-07T20:32:01.3393130Z self, 2025-05-07T20:32:01.3393209Z T: int, 2025-05-07T20:32:01.3393285Z D: int, 2025-05-07T20:32:01.3393378Z scale_ub: Optional[float], 2025-05-07T20:32:01.3393466Z contiguous: bool, 2025-05-07T20:32:01.3393548Z compiled: bool, 2025-05-07T20:32:01.3393626Z ) -> None: 2025-05-07T20:32:01.3393719Z torch.manual_seed(2025) 2025-05-07T20:32:01.3393790Z 2025-05-07T20:32:01.3393949Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3394023Z 2025-05-07T20:32:01.3394111Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3394230Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3395955Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3396044Z 2025-05-07T20:32:01.3396158Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.3396162Z 2025-05-07T20:32:01.3396263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3396481Z self=, 2025-05-07T20:32:01.3396558Z T=2048, 2025-05-07T20:32:01.3396632Z D=7168, 2025-05-07T20:32:01.3396710Z scale_ub=None, 2025-05-07T20:32:01.3396795Z contiguous=True, 2025-05-07T20:32:01.3396875Z compiled=False, 2025-05-07T20:32:01.3396944Z ) 2025-05-07T20:32:01.3397156Z self = 2025-05-07T20:32:01.3397397Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3397402Z 2025-05-07T20:32:01.3397475Z @given( 2025-05-07T20:32:01.3397592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3397687Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3397805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3397917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3398026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3398101Z ) 2025-05-07T20:32:01.3398777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3398895Z def test_silu_mul_quant( 2025-05-07T20:32:01.3398973Z self, 2025-05-07T20:32:01.3399047Z T: int, 2025-05-07T20:32:01.3399120Z D: int, 2025-05-07T20:32:01.3399219Z scale_ub: Optional[float], 2025-05-07T20:32:01.3399303Z contiguous: bool, 2025-05-07T20:32:01.3399385Z compiled: bool, 2025-05-07T20:32:01.3399464Z ) -> None: 2025-05-07T20:32:01.3399563Z torch.manual_seed(2025) 2025-05-07T20:32:01.3399637Z 2025-05-07T20:32:01.3399798Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3399867Z 2025-05-07T20:32:01.3399963Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.3401678Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3401683Z 2025-05-07T20:32:01.3401797Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.3401806Z 2025-05-07T20:32:01.3401904Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3402118Z self=, 2025-05-07T20:32:01.3402197Z T=1, 2025-05-07T20:32:01.3402273Z D=7168, 2025-05-07T20:32:01.3402355Z scale_ub=1200.0, 2025-05-07T20:32:01.3402439Z contiguous=True, 2025-05-07T20:32:01.3402519Z compiled=False, 2025-05-07T20:32:01.3402594Z ) 2025-05-07T20:32:01.3402802Z self = 2025-05-07T20:32:01.3402961Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3402965Z 2025-05-07T20:32:01.3403042Z @given( 2025-05-07T20:32:01.3403156Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3403251Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3403365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3403482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3403590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3403665Z ) 2025-05-07T20:32:01.3403908Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3404152Z def test_silu_mul_quant( 2025-05-07T20:32:01.3404225Z self, 2025-05-07T20:32:01.3404299Z T: int, 2025-05-07T20:32:01.3404377Z D: int, 2025-05-07T20:32:01.3404471Z scale_ub: Optional[float], 2025-05-07T20:32:01.3404557Z contiguous: bool, 2025-05-07T20:32:01.3404644Z compiled: bool, 2025-05-07T20:32:01.3404720Z ) -> None: 2025-05-07T20:32:01.3404810Z torch.manual_seed(2025) 2025-05-07T20:32:01.3404883Z 2025-05-07T20:32:01.3405044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3405114Z 2025-05-07T20:32:01.3405206Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3405327Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3405529Z x = x_sign * x_clamp 2025-05-07T20:32:01.3405609Z x0 = x[:, :D] 2025-05-07T20:32:01.3405687Z x1 = x[:, D:] 2025-05-07T20:32:01.3405760Z 2025-05-07T20:32:01.3405839Z if contiguous: 2025-05-07T20:32:01.3405934Z x0 = x0.contiguous() 2025-05-07T20:32:01.3406025Z x1 = x1.contiguous() 2025-05-07T20:32:01.3406093Z 2025-05-07T20:32:01.3406180Z if scale_ub is not None: 2025-05-07T20:32:01.3406284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3406417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3406489Z ) 2025-05-07T20:32:01.3406566Z else: 2025-05-07T20:32:01.3406656Z scale_ub_tensor = None 2025-05-07T20:32:01.3406725Z 2025-05-07T20:32:01.3406853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3406941Z op = silu_mul_quant 2025-05-07T20:32:01.3407025Z if compiled: 2025-05-07T20:32:01.3407128Z op = torch.compile(op) 2025-05-07T20:32:01.3407232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3407311Z 2025-05-07T20:32:01.3410557Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3410573Z 2025-05-07T20:32:01.3410684Z moe/activation_test.py:117: 2025-05-07T20:32:01.3410822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3410925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3411029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3411531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3411630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3411991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3412215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3412551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3412648Z kernel = self.compile( 2025-05-07T20:32:01.3413033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3413204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3413331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3413335Z 2025-05-07T20:32:01.3413535Z self = 2025-05-07T20:32:01.3414405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3414906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973f8400>} 2025-05-07T20:32:01.3415697Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3415992Z context = 2025-05-07T20:32:01.3415997Z 2025-05-07T20:32:01.3416161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3416422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3416530Z module_map=module_map) 2025-05-07T20:32:01.3416693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3416791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3416868Z E ^ 2025-05-07T20:32:01.3417318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3417323Z 2025-05-07T20:32:01.3417728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3417739Z 2025-05-07T20:32:01.3417841Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3418060Z self=, 2025-05-07T20:32:01.3418138Z T=128, 2025-05-07T20:32:01.3418223Z D=5120, 2025-05-07T20:32:01.3418304Z scale_ub=None, 2025-05-07T20:32:01.3418389Z contiguous=True, 2025-05-07T20:32:01.3418477Z compiled=False, 2025-05-07T20:32:01.3418550Z ) 2025-05-07T20:32:01.3418763Z self = 2025-05-07T20:32:01.3418929Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3418933Z 2025-05-07T20:32:01.3419015Z @given( 2025-05-07T20:32:01.3419133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3419236Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3419351Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3419474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3419585Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3419658Z ) 2025-05-07T20:32:01.3419903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3419994Z def test_silu_mul_quant( 2025-05-07T20:32:01.3420068Z self, 2025-05-07T20:32:01.3420146Z T: int, 2025-05-07T20:32:01.3420221Z D: int, 2025-05-07T20:32:01.3420318Z scale_ub: Optional[float], 2025-05-07T20:32:01.3420408Z contiguous: bool, 2025-05-07T20:32:01.3420493Z compiled: bool, 2025-05-07T20:32:01.3420571Z ) -> None: 2025-05-07T20:32:01.3420671Z torch.manual_seed(2025) 2025-05-07T20:32:01.3420743Z 2025-05-07T20:32:01.3420913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3420984Z 2025-05-07T20:32:01.3421074Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3421204Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3421294Z x = x_sign * x_clamp 2025-05-07T20:32:01.3421374Z x0 = x[:, :D] 2025-05-07T20:32:01.3421457Z x1 = x[:, D:] 2025-05-07T20:32:01.3421531Z 2025-05-07T20:32:01.3421613Z if contiguous: 2025-05-07T20:32:01.3421705Z x0 = x0.contiguous() 2025-05-07T20:32:01.3421794Z x1 = x1.contiguous() 2025-05-07T20:32:01.3421865Z 2025-05-07T20:32:01.3421960Z if scale_ub is not None: 2025-05-07T20:32:01.3422063Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3422204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3422278Z ) 2025-05-07T20:32:01.3422358Z else: 2025-05-07T20:32:01.3422455Z scale_ub_tensor = None 2025-05-07T20:32:01.3422526Z 2025-05-07T20:32:01.3422653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3422828Z op = silu_mul_quant 2025-05-07T20:32:01.3422910Z if compiled: 2025-05-07T20:32:01.3423010Z op = torch.compile(op) 2025-05-07T20:32:01.3423117Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3423189Z 2025-05-07T20:32:01.3423278Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3423285Z 2025-05-07T20:32:01.3423379Z moe/activation_test.py:117: 2025-05-07T20:32:01.3423507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3423608Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3423705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3424270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3424374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3424726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3424952Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3425286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3425391Z kernel = self.compile( 2025-05-07T20:32:01.3425810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3425986Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3426115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3426119Z 2025-05-07T20:32:01.3426324Z self = 2025-05-07T20:32:01.3427093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3427589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973f9300>} 2025-05-07T20:32:01.3428322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3428509Z context = 2025-05-07T20:32:01.3428514Z 2025-05-07T20:32:01.3428675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3428940Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3429047Z module_map=module_map) 2025-05-07T20:32:01.3429216Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3429318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3429392Z E ^ 2025-05-07T20:32:01.3429743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3429748Z 2025-05-07T20:32:01.3430154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3430159Z 2025-05-07T20:32:01.3430268Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3430485Z self=, 2025-05-07T20:32:01.3430562Z T=128, 2025-05-07T20:32:01.3430641Z D=7168, 2025-05-07T20:32:01.3430723Z scale_ub=None, 2025-05-07T20:32:01.3430814Z contiguous=True, 2025-05-07T20:32:01.3430901Z compiled=False, 2025-05-07T20:32:01.3430974Z ) 2025-05-07T20:32:01.3431189Z self = 2025-05-07T20:32:01.3431443Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3431447Z 2025-05-07T20:32:01.3431524Z @given( 2025-05-07T20:32:01.3431647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3431746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3431859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3431977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3432087Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3432160Z ) 2025-05-07T20:32:01.3432404Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3432495Z def test_silu_mul_quant( 2025-05-07T20:32:01.3432569Z self, 2025-05-07T20:32:01.3432723Z T: int, 2025-05-07T20:32:01.3432800Z D: int, 2025-05-07T20:32:01.3432897Z scale_ub: Optional[float], 2025-05-07T20:32:01.3432987Z contiguous: bool, 2025-05-07T20:32:01.3433077Z compiled: bool, 2025-05-07T20:32:01.3433162Z ) -> None: 2025-05-07T20:32:01.3433256Z torch.manual_seed(2025) 2025-05-07T20:32:01.3433326Z 2025-05-07T20:32:01.3433494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3433565Z 2025-05-07T20:32:01.3433654Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3433779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3433865Z x = x_sign * x_clamp 2025-05-07T20:32:01.3433945Z x0 = x[:, :D] 2025-05-07T20:32:01.3434027Z x1 = x[:, D:] 2025-05-07T20:32:01.3434098Z 2025-05-07T20:32:01.3434180Z if contiguous: 2025-05-07T20:32:01.3434273Z x0 = x0.contiguous() 2025-05-07T20:32:01.3434365Z x1 = x1.contiguous() 2025-05-07T20:32:01.3434436Z 2025-05-07T20:32:01.3434528Z if scale_ub is not None: 2025-05-07T20:32:01.3434633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3434769Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3434849Z ) 2025-05-07T20:32:01.3434923Z else: 2025-05-07T20:32:01.3435019Z scale_ub_tensor = None 2025-05-07T20:32:01.3435091Z 2025-05-07T20:32:01.3435218Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3435307Z op = silu_mul_quant 2025-05-07T20:32:01.3435391Z if compiled: 2025-05-07T20:32:01.3435488Z op = torch.compile(op) 2025-05-07T20:32:01.3435593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3435665Z 2025-05-07T20:32:01.3435753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3435760Z 2025-05-07T20:32:01.3435855Z moe/activation_test.py:117: 2025-05-07T20:32:01.3435985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3436086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3436184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3436680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3436777Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3437130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3437349Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3437684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3437776Z kernel = self.compile( 2025-05-07T20:32:01.3438160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3438331Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3438457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3438544Z 2025-05-07T20:32:01.3438750Z self = 2025-05-07T20:32:01.3439514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3440010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973fa0c0>} 2025-05-07T20:32:01.3440814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3441006Z context = 2025-05-07T20:32:01.3441010Z 2025-05-07T20:32:01.3441171Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3441437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3441546Z module_map=module_map) 2025-05-07T20:32:01.3441705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3441802Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3441881Z E ^ 2025-05-07T20:32:01.3442228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3442232Z 2025-05-07T20:32:01.3442638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3442646Z 2025-05-07T20:32:01.3442749Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3442968Z self=, 2025-05-07T20:32:01.3443046Z T=2048, 2025-05-07T20:32:01.3443126Z D=7168, 2025-05-07T20:32:01.3443210Z scale_ub=1200.0, 2025-05-07T20:32:01.3443296Z contiguous=True, 2025-05-07T20:32:01.3443377Z compiled=False, 2025-05-07T20:32:01.3443451Z ) 2025-05-07T20:32:01.3443667Z self = 2025-05-07T20:32:01.3443840Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3443844Z 2025-05-07T20:32:01.3443923Z @given( 2025-05-07T20:32:01.3444039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3444137Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3444255Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3444374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3444489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3444561Z ) 2025-05-07T20:32:01.3444801Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3444903Z def test_silu_mul_quant( 2025-05-07T20:32:01.3444978Z self, 2025-05-07T20:32:01.3445052Z T: int, 2025-05-07T20:32:01.3445136Z D: int, 2025-05-07T20:32:01.3445234Z scale_ub: Optional[float], 2025-05-07T20:32:01.3445323Z contiguous: bool, 2025-05-07T20:32:01.3445410Z compiled: bool, 2025-05-07T20:32:01.3445486Z ) -> None: 2025-05-07T20:32:01.3445580Z torch.manual_seed(2025) 2025-05-07T20:32:01.3445656Z 2025-05-07T20:32:01.3445822Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3447571Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3447684Z 2025-05-07T20:32:01.3447802Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3447806Z 2025-05-07T20:32:01.3447908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3448124Z self=, 2025-05-07T20:32:01.3448198Z T=1, 2025-05-07T20:32:01.3448278Z D=5120, 2025-05-07T20:32:01.3448361Z scale_ub=1200.0, 2025-05-07T20:32:01.3448443Z contiguous=True, 2025-05-07T20:32:01.3448527Z compiled=False, 2025-05-07T20:32:01.3448599Z ) 2025-05-07T20:32:01.3448888Z self = 2025-05-07T20:32:01.3449056Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3449060Z 2025-05-07T20:32:01.3449142Z @given( 2025-05-07T20:32:01.3449262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3449358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3449470Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3449587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3449698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3449769Z ) 2025-05-07T20:32:01.3450015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3450108Z def test_silu_mul_quant( 2025-05-07T20:32:01.3450181Z self, 2025-05-07T20:32:01.3450258Z T: int, 2025-05-07T20:32:01.3450332Z D: int, 2025-05-07T20:32:01.3450435Z scale_ub: Optional[float], 2025-05-07T20:32:01.3450527Z contiguous: bool, 2025-05-07T20:32:01.3450616Z compiled: bool, 2025-05-07T20:32:01.3450696Z ) -> None: 2025-05-07T20:32:01.3450790Z torch.manual_seed(2025) 2025-05-07T20:32:01.3450866Z 2025-05-07T20:32:01.3451034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3451106Z 2025-05-07T20:32:01.3451195Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3451320Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3451409Z x = x_sign * x_clamp 2025-05-07T20:32:01.3451486Z x0 = x[:, :D] 2025-05-07T20:32:01.3451567Z x1 = x[:, D:] 2025-05-07T20:32:01.3451637Z 2025-05-07T20:32:01.3451718Z if contiguous: 2025-05-07T20:32:01.3451812Z x0 = x0.contiguous() 2025-05-07T20:32:01.3451900Z x1 = x1.contiguous() 2025-05-07T20:32:01.3451972Z 2025-05-07T20:32:01.3452061Z if scale_ub is not None: 2025-05-07T20:32:01.3452170Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3452306Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3452381Z ) 2025-05-07T20:32:01.3452456Z else: 2025-05-07T20:32:01.3452554Z scale_ub_tensor = None 2025-05-07T20:32:01.3452625Z 2025-05-07T20:32:01.3452754Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3452847Z op = silu_mul_quant 2025-05-07T20:32:01.3452929Z if compiled: 2025-05-07T20:32:01.3453026Z op = torch.compile(op) 2025-05-07T20:32:01.3453132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3453202Z 2025-05-07T20:32:01.3453294Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3453298Z 2025-05-07T20:32:01.3453393Z moe/activation_test.py:117: 2025-05-07T20:32:01.3453519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3453704Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3453806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3454296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3454483Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3454839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3455064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3455400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3455489Z kernel = self.compile( 2025-05-07T20:32:01.3455872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3456042Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3456237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3456246Z 2025-05-07T20:32:01.3456445Z self = 2025-05-07T20:32:01.3457214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3457716Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93973fb6a0>} 2025-05-07T20:32:01.3458451Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3458645Z context = 2025-05-07T20:32:01.3458650Z 2025-05-07T20:32:01.3458810Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3459066Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3459180Z module_map=module_map) 2025-05-07T20:32:01.3459336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3459430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3459507Z E ^ 2025-05-07T20:32:01.3459853Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3459857Z 2025-05-07T20:32:01.3460262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3460266Z 2025-05-07T20:32:01.3460366Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3460585Z self=, 2025-05-07T20:32:01.3460661Z T=2048, 2025-05-07T20:32:01.3460735Z D=5120, 2025-05-07T20:32:01.3460816Z scale_ub=None, 2025-05-07T20:32:01.3460897Z contiguous=True, 2025-05-07T20:32:01.3460985Z compiled=False, 2025-05-07T20:32:01.3461059Z ) 2025-05-07T20:32:01.3461270Z self = 2025-05-07T20:32:01.3461435Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3461440Z 2025-05-07T20:32:01.3461517Z @given( 2025-05-07T20:32:01.3461632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3461727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3461840Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3461951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3462063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3462134Z ) 2025-05-07T20:32:01.3462377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3462469Z def test_silu_mul_quant( 2025-05-07T20:32:01.3462542Z self, 2025-05-07T20:32:01.3462699Z T: int, 2025-05-07T20:32:01.3462774Z D: int, 2025-05-07T20:32:01.3462869Z scale_ub: Optional[float], 2025-05-07T20:32:01.3462954Z contiguous: bool, 2025-05-07T20:32:01.3463039Z compiled: bool, 2025-05-07T20:32:01.3463113Z ) -> None: 2025-05-07T20:32:01.3463204Z torch.manual_seed(2025) 2025-05-07T20:32:01.3463276Z 2025-05-07T20:32:01.3463437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3463509Z 2025-05-07T20:32:01.3463597Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.3465409Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3465425Z 2025-05-07T20:32:01.3465539Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.3465543Z 2025-05-07T20:32:01.3465642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3465861Z self=, 2025-05-07T20:32:01.3465937Z T=16384, 2025-05-07T20:32:01.3466012Z D=5120, 2025-05-07T20:32:01.3466093Z scale_ub=None, 2025-05-07T20:32:01.3466175Z contiguous=True, 2025-05-07T20:32:01.3466260Z compiled=False, 2025-05-07T20:32:01.3466333Z ) 2025-05-07T20:32:01.3466549Z self = 2025-05-07T20:32:01.3466722Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3466726Z 2025-05-07T20:32:01.3466798Z @given( 2025-05-07T20:32:01.3466919Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3467017Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3467128Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3467241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3467353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3467425Z ) 2025-05-07T20:32:01.3467662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3467755Z def test_silu_mul_quant( 2025-05-07T20:32:01.3467830Z self, 2025-05-07T20:32:01.3467907Z T: int, 2025-05-07T20:32:01.3467980Z D: int, 2025-05-07T20:32:01.3468075Z scale_ub: Optional[float], 2025-05-07T20:32:01.3468169Z contiguous: bool, 2025-05-07T20:32:01.3468252Z compiled: bool, 2025-05-07T20:32:01.3468329Z ) -> None: 2025-05-07T20:32:01.3468422Z torch.manual_seed(2025) 2025-05-07T20:32:01.3468491Z 2025-05-07T20:32:01.3468655Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3470382Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3470388Z 2025-05-07T20:32:01.3470507Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3470512Z 2025-05-07T20:32:01.3470613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3470828Z self=, 2025-05-07T20:32:01.3470986Z T=4096, 2025-05-07T20:32:01.3471061Z D=5120, 2025-05-07T20:32:01.3471139Z scale_ub=None, 2025-05-07T20:32:01.3471221Z contiguous=True, 2025-05-07T20:32:01.3471304Z compiled=False, 2025-05-07T20:32:01.3471374Z ) 2025-05-07T20:32:01.3471585Z self = 2025-05-07T20:32:01.3471753Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3471757Z 2025-05-07T20:32:01.3471830Z @given( 2025-05-07T20:32:01.3471942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3472039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3472150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3472408Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3472522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3472592Z ) 2025-05-07T20:32:01.3472833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3472930Z def test_silu_mul_quant( 2025-05-07T20:32:01.3473002Z self, 2025-05-07T20:32:01.3473079Z T: int, 2025-05-07T20:32:01.3473151Z D: int, 2025-05-07T20:32:01.3473245Z scale_ub: Optional[float], 2025-05-07T20:32:01.3473334Z contiguous: bool, 2025-05-07T20:32:01.3473417Z compiled: bool, 2025-05-07T20:32:01.3473492Z ) -> None: 2025-05-07T20:32:01.3473586Z torch.manual_seed(2025) 2025-05-07T20:32:01.3473656Z 2025-05-07T20:32:01.3473817Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3475571Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3475581Z 2025-05-07T20:32:01.3475710Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3475715Z 2025-05-07T20:32:01.3475822Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3476036Z self=, 2025-05-07T20:32:01.3476111Z T=2048, 2025-05-07T20:32:01.3476185Z D=5120, 2025-05-07T20:32:01.3476263Z scale_ub=None, 2025-05-07T20:32:01.3476346Z contiguous=False, 2025-05-07T20:32:01.3476426Z compiled=False, 2025-05-07T20:32:01.3476499Z ) 2025-05-07T20:32:01.3476712Z self = 2025-05-07T20:32:01.3476876Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3476885Z 2025-05-07T20:32:01.3476959Z @given( 2025-05-07T20:32:01.3477077Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3477173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3477287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3477399Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3477508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3477582Z ) 2025-05-07T20:32:01.3477819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3477908Z def test_silu_mul_quant( 2025-05-07T20:32:01.3477985Z self, 2025-05-07T20:32:01.3478058Z T: int, 2025-05-07T20:32:01.3478131Z D: int, 2025-05-07T20:32:01.3478234Z scale_ub: Optional[float], 2025-05-07T20:32:01.3478319Z contiguous: bool, 2025-05-07T20:32:01.3478402Z compiled: bool, 2025-05-07T20:32:01.3478481Z ) -> None: 2025-05-07T20:32:01.3478679Z torch.manual_seed(2025) 2025-05-07T20:32:01.3478751Z 2025-05-07T20:32:01.3478911Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3480634Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3480714Z 2025-05-07T20:32:01.3480828Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3480832Z 2025-05-07T20:32:01.3480933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3481150Z self=, 2025-05-07T20:32:01.3481229Z T=4096, 2025-05-07T20:32:01.3481301Z D=7168, 2025-05-07T20:32:01.3481383Z scale_ub=None, 2025-05-07T20:32:01.3481464Z contiguous=True, 2025-05-07T20:32:01.3481543Z compiled=True, 2025-05-07T20:32:01.3481616Z ) 2025-05-07T20:32:01.3481826Z self = 2025-05-07T20:32:01.3481994Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.3481998Z 2025-05-07T20:32:01.3482071Z @given( 2025-05-07T20:32:01.3482187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3482284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3482398Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3482510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3482623Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3482701Z ) 2025-05-07T20:32:01.3482942Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3483031Z def test_silu_mul_quant( 2025-05-07T20:32:01.3483105Z self, 2025-05-07T20:32:01.3483183Z T: int, 2025-05-07T20:32:01.3483256Z D: int, 2025-05-07T20:32:01.3483350Z scale_ub: Optional[float], 2025-05-07T20:32:01.3483438Z contiguous: bool, 2025-05-07T20:32:01.3483520Z compiled: bool, 2025-05-07T20:32:01.3483594Z ) -> None: 2025-05-07T20:32:01.3483687Z torch.manual_seed(2025) 2025-05-07T20:32:01.3483756Z 2025-05-07T20:32:01.3483917Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3485647Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3485657Z 2025-05-07T20:32:01.3485772Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3485776Z 2025-05-07T20:32:01.3485878Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3486093Z self=, 2025-05-07T20:32:01.3486169Z T=2048, 2025-05-07T20:32:01.3486246Z D=5120, 2025-05-07T20:32:01.3486326Z scale_ub=1200.0, 2025-05-07T20:32:01.3486412Z contiguous=False, 2025-05-07T20:32:01.3486494Z compiled=False, 2025-05-07T20:32:01.3486564Z ) 2025-05-07T20:32:01.3486777Z self = 2025-05-07T20:32:01.3486946Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3487035Z 2025-05-07T20:32:01.3487109Z @given( 2025-05-07T20:32:01.3487226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3487322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3487434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3487548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3487656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3487729Z ) 2025-05-07T20:32:01.3487968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3488058Z def test_silu_mul_quant( 2025-05-07T20:32:01.3488135Z self, 2025-05-07T20:32:01.3488281Z T: int, 2025-05-07T20:32:01.3488356Z D: int, 2025-05-07T20:32:01.3488451Z scale_ub: Optional[float], 2025-05-07T20:32:01.3488537Z contiguous: bool, 2025-05-07T20:32:01.3488620Z compiled: bool, 2025-05-07T20:32:01.3488707Z ) -> None: 2025-05-07T20:32:01.3488795Z torch.manual_seed(2025) 2025-05-07T20:32:01.3488867Z 2025-05-07T20:32:01.3489027Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3490746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3490755Z 2025-05-07T20:32:01.3490868Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3490872Z 2025-05-07T20:32:01.3490975Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3491194Z self=, 2025-05-07T20:32:01.3491267Z T=4096, 2025-05-07T20:32:01.3491340Z D=7168, 2025-05-07T20:32:01.3491422Z scale_ub=1200.0, 2025-05-07T20:32:01.3491502Z contiguous=True, 2025-05-07T20:32:01.3491581Z compiled=False, 2025-05-07T20:32:01.3491655Z ) 2025-05-07T20:32:01.3491864Z self = 2025-05-07T20:32:01.3492032Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3492036Z 2025-05-07T20:32:01.3492109Z @given( 2025-05-07T20:32:01.3492223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3492325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3492434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3492544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3492660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3492731Z ) 2025-05-07T20:32:01.3492970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3493060Z def test_silu_mul_quant( 2025-05-07T20:32:01.3493133Z self, 2025-05-07T20:32:01.3493208Z T: int, 2025-05-07T20:32:01.3493281Z D: int, 2025-05-07T20:32:01.3493375Z scale_ub: Optional[float], 2025-05-07T20:32:01.3493462Z contiguous: bool, 2025-05-07T20:32:01.3493542Z compiled: bool, 2025-05-07T20:32:01.3493692Z ) -> None: 2025-05-07T20:32:01.3493787Z torch.manual_seed(2025) 2025-05-07T20:32:01.3493856Z 2025-05-07T20:32:01.3494024Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3495746Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3495841Z 2025-05-07T20:32:01.3495953Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3495958Z 2025-05-07T20:32:01.3496058Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3496272Z self=, 2025-05-07T20:32:01.3496348Z T=16384, 2025-05-07T20:32:01.3496421Z D=7168, 2025-05-07T20:32:01.3496571Z scale_ub=None, 2025-05-07T20:32:01.3496659Z contiguous=False, 2025-05-07T20:32:01.3496738Z compiled=True, 2025-05-07T20:32:01.3496808Z ) 2025-05-07T20:32:01.3497025Z self = 2025-05-07T20:32:01.3497199Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:01.3497203Z 2025-05-07T20:32:01.3497275Z @given( 2025-05-07T20:32:01.3497394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3497489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3497604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3497718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3497827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3497900Z ) 2025-05-07T20:32:01.3498136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3498546Z def test_silu_mul_quant( 2025-05-07T20:32:01.3498661Z self, 2025-05-07T20:32:01.3498741Z T: int, 2025-05-07T20:32:01.3498814Z D: int, 2025-05-07T20:32:01.3498911Z scale_ub: Optional[float], 2025-05-07T20:32:01.3499002Z contiguous: bool, 2025-05-07T20:32:01.3499082Z compiled: bool, 2025-05-07T20:32:01.3499161Z ) -> None: 2025-05-07T20:32:01.3499252Z torch.manual_seed(2025) 2025-05-07T20:32:01.3499325Z 2025-05-07T20:32:01.3499486Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3501212Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3501222Z 2025-05-07T20:32:01.3501333Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3501341Z 2025-05-07T20:32:01.3501439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3501657Z self=, 2025-05-07T20:32:01.3501730Z T=4096, 2025-05-07T20:32:01.3501805Z D=7168, 2025-05-07T20:32:01.3501887Z scale_ub=None, 2025-05-07T20:32:01.3501970Z contiguous=True, 2025-05-07T20:32:01.3502049Z compiled=False, 2025-05-07T20:32:01.3502122Z ) 2025-05-07T20:32:01.3502331Z self = 2025-05-07T20:32:01.3502497Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3502501Z 2025-05-07T20:32:01.3502574Z @given( 2025-05-07T20:32:01.3502692Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3502792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3502903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3503185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3503298Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3503368Z ) 2025-05-07T20:32:01.3503614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3503704Z def test_silu_mul_quant( 2025-05-07T20:32:01.3503778Z self, 2025-05-07T20:32:01.3503853Z T: int, 2025-05-07T20:32:01.3503928Z D: int, 2025-05-07T20:32:01.3504021Z scale_ub: Optional[float], 2025-05-07T20:32:01.3504113Z contiguous: bool, 2025-05-07T20:32:01.3504194Z compiled: bool, 2025-05-07T20:32:01.3504271Z ) -> None: 2025-05-07T20:32:01.3504365Z torch.manual_seed(2025) 2025-05-07T20:32:01.3504435Z 2025-05-07T20:32:01.3504741Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3506519Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3506531Z 2025-05-07T20:32:01.3506646Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3506650Z 2025-05-07T20:32:01.3506752Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3506968Z self=, 2025-05-07T20:32:01.3507049Z T=16384, 2025-05-07T20:32:01.3507124Z D=7168, 2025-05-07T20:32:01.3507202Z scale_ub=None, 2025-05-07T20:32:01.3507288Z contiguous=True, 2025-05-07T20:32:01.3507368Z compiled=False, 2025-05-07T20:32:01.3507443Z ) 2025-05-07T20:32:01.3507655Z self = 2025-05-07T20:32:01.3507821Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:01.3507826Z 2025-05-07T20:32:01.3507899Z @given( 2025-05-07T20:32:01.3508017Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3508112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3508224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3508336Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3508446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3508522Z ) 2025-05-07T20:32:01.3508765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3508857Z def test_silu_mul_quant( 2025-05-07T20:32:01.3508933Z self, 2025-05-07T20:32:01.3509006Z T: int, 2025-05-07T20:32:01.3509080Z D: int, 2025-05-07T20:32:01.3509184Z scale_ub: Optional[float], 2025-05-07T20:32:01.3509270Z contiguous: bool, 2025-05-07T20:32:01.3509353Z compiled: bool, 2025-05-07T20:32:01.3509432Z ) -> None: 2025-05-07T20:32:01.3509523Z torch.manual_seed(2025) 2025-05-07T20:32:01.3509596Z 2025-05-07T20:32:01.3509755Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3511477Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3511572Z 2025-05-07T20:32:01.3511686Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3511690Z 2025-05-07T20:32:01.3511790Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3512007Z self=, 2025-05-07T20:32:01.3512081Z T=16384, 2025-05-07T20:32:01.3512155Z D=7168, 2025-05-07T20:32:01.3512238Z scale_ub=1200.0, 2025-05-07T20:32:01.3512320Z contiguous=True, 2025-05-07T20:32:01.3512402Z compiled=False, 2025-05-07T20:32:01.3512478Z ) 2025-05-07T20:32:01.3512687Z self = 2025-05-07T20:32:01.3512859Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3512942Z 2025-05-07T20:32:01.3513018Z @given( 2025-05-07T20:32:01.3513131Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3513229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3513345Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3513459Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3513574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3513645Z ) 2025-05-07T20:32:01.3513889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3513978Z def test_silu_mul_quant( 2025-05-07T20:32:01.3514052Z self, 2025-05-07T20:32:01.3514128Z T: int, 2025-05-07T20:32:01.3514201Z D: int, 2025-05-07T20:32:01.3514295Z scale_ub: Optional[float], 2025-05-07T20:32:01.3514383Z contiguous: bool, 2025-05-07T20:32:01.3514467Z compiled: bool, 2025-05-07T20:32:01.3514542Z ) -> None: 2025-05-07T20:32:01.3514641Z torch.manual_seed(2025) 2025-05-07T20:32:01.3514713Z 2025-05-07T20:32:01.3514875Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3516596Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3516607Z 2025-05-07T20:32:01.3516721Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3516725Z 2025-05-07T20:32:01.3516827Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3517045Z self=, 2025-05-07T20:32:01.3517122Z T=128, 2025-05-07T20:32:01.3517195Z D=5120, 2025-05-07T20:32:01.3517275Z scale_ub=1200.0, 2025-05-07T20:32:01.3517360Z contiguous=False, 2025-05-07T20:32:01.3517446Z compiled=False, 2025-05-07T20:32:01.3517516Z ) 2025-05-07T20:32:01.3517728Z self = 2025-05-07T20:32:01.3517892Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:01.3517897Z 2025-05-07T20:32:01.3517970Z @given( 2025-05-07T20:32:01.3518088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3518185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3518298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3518415Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3518522Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3518601Z ) 2025-05-07T20:32:01.3518839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3518927Z def test_silu_mul_quant( 2025-05-07T20:32:01.3519004Z self, 2025-05-07T20:32:01.3519163Z T: int, 2025-05-07T20:32:01.3519237Z D: int, 2025-05-07T20:32:01.3519335Z scale_ub: Optional[float], 2025-05-07T20:32:01.3519421Z contiguous: bool, 2025-05-07T20:32:01.3519503Z compiled: bool, 2025-05-07T20:32:01.3519580Z ) -> None: 2025-05-07T20:32:01.3519670Z torch.manual_seed(2025) 2025-05-07T20:32:01.3519746Z 2025-05-07T20:32:01.3519906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3519976Z 2025-05-07T20:32:01.3520067Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3520188Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3520275Z x = x_sign * x_clamp 2025-05-07T20:32:01.3520356Z x0 = x[:, :D] 2025-05-07T20:32:01.3520510Z x1 = x[:, D:] 2025-05-07T20:32:01.3520581Z 2025-05-07T20:32:01.3520668Z if contiguous: 2025-05-07T20:32:01.3520756Z x0 = x0.contiguous() 2025-05-07T20:32:01.3520842Z x1 = x1.contiguous() 2025-05-07T20:32:01.3520921Z 2025-05-07T20:32:01.3521009Z if scale_ub is not None: 2025-05-07T20:32:01.3521114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3521246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3521319Z ) 2025-05-07T20:32:01.3521395Z else: 2025-05-07T20:32:01.3521487Z scale_ub_tensor = None 2025-05-07T20:32:01.3521556Z 2025-05-07T20:32:01.3521684Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3521772Z op = silu_mul_quant 2025-05-07T20:32:01.3521853Z if compiled: 2025-05-07T20:32:01.3521951Z op = torch.compile(op) 2025-05-07T20:32:01.3522058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3522127Z 2025-05-07T20:32:01.3522220Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3522224Z 2025-05-07T20:32:01.3522317Z moe/activation_test.py:117: 2025-05-07T20:32:01.3522454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3522551Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3522648Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3523143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3523236Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3523595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3523817Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3524159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3524255Z kernel = self.compile( 2025-05-07T20:32:01.3524635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3524811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3524942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3524947Z 2025-05-07T20:32:01.3525148Z self = 2025-05-07T20:32:01.3525920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3526418Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9397275bc0>} 2025-05-07T20:32:01.3527150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3527425Z context = 2025-05-07T20:32:01.3527430Z 2025-05-07T20:32:01.3527595Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3527856Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3527964Z module_map=module_map) 2025-05-07T20:32:01.3528126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3528228Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3528304Z E ^ 2025-05-07T20:32:01.3528731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3528736Z 2025-05-07T20:32:01.3529146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3529156Z 2025-05-07T20:32:01.3529260Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3529484Z self=, 2025-05-07T20:32:01.3529557Z T=2048, 2025-05-07T20:32:01.3529631Z D=7168, 2025-05-07T20:32:01.3529713Z scale_ub=None, 2025-05-07T20:32:01.3529795Z contiguous=False, 2025-05-07T20:32:01.3529879Z compiled=False, 2025-05-07T20:32:01.3529948Z ) 2025-05-07T20:32:01.3530160Z self = 2025-05-07T20:32:01.3530331Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:01.3530335Z 2025-05-07T20:32:01.3530408Z @given( 2025-05-07T20:32:01.3530528Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3530627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3530739Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3530852Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3530970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3531040Z ) 2025-05-07T20:32:01.3531282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3531372Z def test_silu_mul_quant( 2025-05-07T20:32:01.3531445Z self, 2025-05-07T20:32:01.3531521Z T: int, 2025-05-07T20:32:01.3531595Z D: int, 2025-05-07T20:32:01.3531690Z scale_ub: Optional[float], 2025-05-07T20:32:01.3535065Z contiguous: bool, 2025-05-07T20:32:01.3535167Z compiled: bool, 2025-05-07T20:32:01.3535250Z ) -> None: 2025-05-07T20:32:01.3535348Z torch.manual_seed(2025) 2025-05-07T20:32:01.3535420Z 2025-05-07T20:32:01.3535593Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3537334Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3537345Z 2025-05-07T20:32:01.3537462Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3537466Z 2025-05-07T20:32:01.3537567Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3537787Z self=, 2025-05-07T20:32:01.3537866Z T=128, 2025-05-07T20:32:01.3537945Z D=7168, 2025-05-07T20:32:01.3538025Z scale_ub=1200.0, 2025-05-07T20:32:01.3538109Z contiguous=True, 2025-05-07T20:32:01.3538188Z compiled=True, 2025-05-07T20:32:01.3538257Z ) 2025-05-07T20:32:01.3538583Z self = 2025-05-07T20:32:01.3538743Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3538748Z 2025-05-07T20:32:01.3538824Z @given( 2025-05-07T20:32:01.3538945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3539040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3539154Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3539265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3539372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3539447Z ) 2025-05-07T20:32:01.3539686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3539877Z def test_silu_mul_quant( 2025-05-07T20:32:01.3539956Z self, 2025-05-07T20:32:01.3540031Z T: int, 2025-05-07T20:32:01.3540104Z D: int, 2025-05-07T20:32:01.3540202Z scale_ub: Optional[float], 2025-05-07T20:32:01.3540295Z contiguous: bool, 2025-05-07T20:32:01.3540377Z compiled: bool, 2025-05-07T20:32:01.3540455Z ) -> None: 2025-05-07T20:32:01.3540548Z torch.manual_seed(2025) 2025-05-07T20:32:01.3540620Z 2025-05-07T20:32:01.3540783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3540852Z 2025-05-07T20:32:01.3540943Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3541064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3541151Z x = x_sign * x_clamp 2025-05-07T20:32:01.3541234Z x0 = x[:, :D] 2025-05-07T20:32:01.3541311Z x1 = x[:, D:] 2025-05-07T20:32:01.3541380Z 2025-05-07T20:32:01.3541468Z if contiguous: 2025-05-07T20:32:01.3541557Z x0 = x0.contiguous() 2025-05-07T20:32:01.3541643Z x1 = x1.contiguous() 2025-05-07T20:32:01.3541715Z 2025-05-07T20:32:01.3541801Z if scale_ub is not None: 2025-05-07T20:32:01.3541914Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:01.3542044Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:01.3542117Z ) 2025-05-07T20:32:01.3542194Z else: 2025-05-07T20:32:01.3542284Z scale_ub_tensor = None 2025-05-07T20:32:01.3542353Z 2025-05-07T20:32:01.3542481Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:01.3542569Z op = silu_mul_quant 2025-05-07T20:32:01.3542651Z if compiled: 2025-05-07T20:32:01.3542750Z op = torch.compile(op) 2025-05-07T20:32:01.3542850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3542923Z 2025-05-07T20:32:01.3543014Z > y_fp8, y_scale = fn() 2025-05-07T20:32:01.3543024Z 2025-05-07T20:32:01.3543118Z moe/activation_test.py:117: 2025-05-07T20:32:01.3543249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3543346Z moe/activation_test.py:115: in fn 2025-05-07T20:32:01.3543446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:01.3543815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:01.3543908Z return fn(*args, **kwargs) 2025-05-07T20:32:01.3544392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:01.3544494Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:01.3544845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:01.3545064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:01.3545401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:01.3545494Z kernel = self.compile( 2025-05-07T20:32:01.3545872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:01.3546127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:01.3546251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:01.3546258Z 2025-05-07T20:32:01.3546456Z self = 2025-05-07T20:32:01.3547224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:01.3547792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f93971742c0>} 2025-05-07T20:32:01.3548524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:01.3548716Z context = 2025-05-07T20:32:01.3548720Z 2025-05-07T20:32:01.3548881Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:01.3549135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:01.3549247Z module_map=module_map) 2025-05-07T20:32:01.3549405Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:01.3549505Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:01.3549579Z E ^ 2025-05-07T20:32:01.3549931Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:01.3549935Z 2025-05-07T20:32:01.3550340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:01.3550350Z 2025-05-07T20:32:01.3550454Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3550671Z self=, 2025-05-07T20:32:01.3550745Z T=128, 2025-05-07T20:32:01.3550823Z D=7168, 2025-05-07T20:32:01.3550904Z scale_ub=1200.0, 2025-05-07T20:32:01.3550985Z contiguous=True, 2025-05-07T20:32:01.3551070Z compiled=False, 2025-05-07T20:32:01.3551140Z ) 2025-05-07T20:32:01.3551351Z self = 2025-05-07T20:32:01.3551515Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:01.3551519Z 2025-05-07T20:32:01.3551597Z @given( 2025-05-07T20:32:01.3551718Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3551815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3551931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3552051Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3552161Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3552232Z ) 2025-05-07T20:32:01.3552472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3552563Z def test_silu_mul_quant( 2025-05-07T20:32:01.3552640Z self, 2025-05-07T20:32:01.3552715Z T: int, 2025-05-07T20:32:01.3552789Z D: int, 2025-05-07T20:32:01.3552888Z scale_ub: Optional[float], 2025-05-07T20:32:01.3552973Z contiguous: bool, 2025-05-07T20:32:01.3553055Z compiled: bool, 2025-05-07T20:32:01.3553134Z ) -> None: 2025-05-07T20:32:01.3553225Z torch.manual_seed(2025) 2025-05-07T20:32:01.3553300Z 2025-05-07T20:32:01.3553467Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3553538Z 2025-05-07T20:32:01.3553626Z x_sign = torch.sign(x) 2025-05-07T20:32:01.3553834Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:01.3555557Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3555563Z 2025-05-07T20:32:01.3555679Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:01.3555758Z 2025-05-07T20:32:01.3555859Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3556078Z self=, 2025-05-07T20:32:01.3556154Z T=128, 2025-05-07T20:32:01.3556234Z D=5120, 2025-05-07T20:32:01.3556315Z scale_ub=1200.0, 2025-05-07T20:32:01.3556398Z contiguous=True, 2025-05-07T20:32:01.3556478Z compiled=True, 2025-05-07T20:32:01.3556550Z ) 2025-05-07T20:32:01.3556761Z self = 2025-05-07T20:32:01.3556921Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:01.3556926Z 2025-05-07T20:32:01.3557002Z @given( 2025-05-07T20:32:01.3557117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3557215Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3557324Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3557440Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3557551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3557623Z ) 2025-05-07T20:32:01.3557862Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3557960Z def test_silu_mul_quant( 2025-05-07T20:32:01.3558034Z self, 2025-05-07T20:32:01.3558107Z T: int, 2025-05-07T20:32:01.3558185Z D: int, 2025-05-07T20:32:01.3558279Z scale_ub: Optional[float], 2025-05-07T20:32:01.3558363Z contiguous: bool, 2025-05-07T20:32:01.3558454Z compiled: bool, 2025-05-07T20:32:01.3558529Z ) -> None: 2025-05-07T20:32:01.3558622Z torch.manual_seed(2025) 2025-05-07T20:32:01.3558691Z 2025-05-07T20:32:01.3558850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3558925Z 2025-05-07T20:32:01.3559013Z > x_sign = torch.sign(x) 2025-05-07T20:32:01.3560730Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3560743Z 2025-05-07T20:32:01.3560856Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:01.3560861Z 2025-05-07T20:32:01.3560957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:01.3561175Z self=, 2025-05-07T20:32:01.3561250Z T=128, 2025-05-07T20:32:01.3561323Z D=7168, 2025-05-07T20:32:01.3561403Z scale_ub=None, 2025-05-07T20:32:01.3561484Z contiguous=True, 2025-05-07T20:32:01.3561571Z compiled=True, 2025-05-07T20:32:01.3561642Z ) 2025-05-07T20:32:01.3561853Z self = 2025-05-07T20:32:01.3562014Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:01.3562102Z 2025-05-07T20:32:01.3562175Z @given( 2025-05-07T20:32:01.3562288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:01.3562387Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:01.3562498Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:01.3562610Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:01.3562721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:01.3562792Z ) 2025-05-07T20:32:01.3563034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:01.3563125Z def test_silu_mul_quant( 2025-05-07T20:32:01.3563198Z self, 2025-05-07T20:32:01.3563349Z T: int, 2025-05-07T20:32:01.3563425Z D: int, 2025-05-07T20:32:01.3563519Z scale_ub: Optional[float], 2025-05-07T20:32:01.3563607Z contiguous: bool, 2025-05-07T20:32:01.3563690Z compiled: bool, 2025-05-07T20:32:01.3563771Z ) -> None: 2025-05-07T20:32:01.3563864Z torch.manual_seed(2025) 2025-05-07T20:32:01.3563934Z 2025-05-07T20:32:01.3564094Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:01.3565814Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:01.3565820Z 2025-05-07T20:32:01.3565931Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:01.3566064Z =============================== warnings summary =============================== 2025-05-07T20:32:01.3566369Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.3566667Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.3566959Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:01.3567813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:01.3568048Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:01.3568052Z 2025-05-07T20:32:01.3568224Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:01.3569460Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:01.3569646Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:01.3569651Z 2025-05-07T20:32:01.3569857Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:01.3570014Z ================== 1 failed, 1 passed, 13 warnings in 18.91s =================== 2025-05-07T20:32:03.1473117Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:03.2098401Z 2025-05-07T20:32:03.2098828Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:03.2099753Z 2025-05-07T20:32:03.2099757Z 2025-05-07T20:32:03.2120424Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:05.3948036Z ============================= test session starts ============================== 2025-05-07T20:32:05.3948850Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:05.3949376Z cachedir: .pytest_cache 2025-05-07T20:32:05.3949941Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:05.3950980Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:05.3951383Z plugins: hypothesis-6.131.14 2025-05-07T20:32:06.9480601Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:07.0450324Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:07.0450868Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:07.0451164Z 2025-05-07T20:32:08.9136754Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.9137828Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:08.9139179Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.9140590Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.9141568Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.9142846Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.9144202Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9145484Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.9146831Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9147864Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:08.9149104Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.9150335Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:08.9151170Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:08.9152352Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.9153896Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:08.9154921Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:08.9155925Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:08.9157270Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.9158529Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.9159421Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:08.9160492Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:08.9161518Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:08.9162282Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:08.9163438Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.9164775Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.9165825Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9166733Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9167467Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:08.9168472Z W0507 20:32:08.911000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:08.9308021Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:08.9309058Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:32:08.9310369Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:08.9311779Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:08.9312741Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:08.9314269Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:08.9315626Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:08.9317032Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:08.9318378Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:08.9319405Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] module_map=module_map) 2025-05-07T20:32:08.9320642Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:08.9321864Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:32:08.9322698Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:08.9323887Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:08.9325098Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:32:08.9326107Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:08.9327108Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:32:08.9328303Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:08.9329562Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:08.9330460Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:08.9331521Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:08.9332545Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:32:08.9333305Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:08.9334559Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:08.9335882Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:08.9337021Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:08.9337918Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:08.9338653Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:32:08.9339735Z W0507 20:32:08.929000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3313459Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3314381Z self=, 2025-05-07T20:32:09.3314865Z T=1, 2025-05-07T20:32:09.3315063Z D=5120, 2025-05-07T20:32:09.3315267Z scale_ub=None, 2025-05-07T20:32:09.3315481Z contiguous=True, 2025-05-07T20:32:09.3315711Z compiled=True, 2025-05-07T20:32:09.3315930Z ) 2025-05-07T20:32:09.3316251Z self = 2025-05-07T20:32:09.3316742Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:09.3317001Z 2025-05-07T20:32:09.3317096Z @given( 2025-05-07T20:32:09.3317329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3317674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3318023Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3318359Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3318683Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3318973Z ) 2025-05-07T20:32:09.3319331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3319771Z def test_silu_mul_quant( 2025-05-07T20:32:09.3320018Z self, 2025-05-07T20:32:09.3320218Z T: int, 2025-05-07T20:32:09.3320413Z D: int, 2025-05-07T20:32:09.3320639Z scale_ub: Optional[float], 2025-05-07T20:32:09.3320914Z contiguous: bool, 2025-05-07T20:32:09.3321151Z compiled: bool, 2025-05-07T20:32:09.3321387Z ) -> None: 2025-05-07T20:32:09.3321607Z torch.manual_seed(2025) 2025-05-07T20:32:09.3321844Z 2025-05-07T20:32:09.3322123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3322474Z 2025-05-07T20:32:09.3322677Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3322966Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3323280Z x = x_sign * x_clamp 2025-05-07T20:32:09.3323526Z x0 = x[:, :D] 2025-05-07T20:32:09.3323748Z x1 = x[:, D:] 2025-05-07T20:32:09.3323957Z 2025-05-07T20:32:09.3324148Z if contiguous: 2025-05-07T20:32:09.3324381Z x0 = x0.contiguous() 2025-05-07T20:32:09.3324644Z x1 = x1.contiguous() 2025-05-07T20:32:09.3324886Z 2025-05-07T20:32:09.3325077Z if scale_ub is not None: 2025-05-07T20:32:09.3325352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3325687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3325989Z ) 2025-05-07T20:32:09.3326185Z else: 2025-05-07T20:32:09.3326398Z scale_ub_tensor = None 2025-05-07T20:32:09.3326650Z 2025-05-07T20:32:09.3326884Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3327209Z op = silu_mul_quant 2025-05-07T20:32:09.3327467Z if compiled: 2025-05-07T20:32:09.3327711Z op = torch.compile(op) 2025-05-07T20:32:09.3328009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3328660Z 2025-05-07T20:32:09.3328850Z y_fp8, y_scale = fn() 2025-05-07T20:32:09.3329136Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:09.3329434Z 2025-05-07T20:32:09.3329666Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3330002Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:09.3330303Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:09.3330611Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:09.3330973Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.3331288Z 2025-05-07T20:32:09.3331486Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:09.3331688Z 2025-05-07T20:32:09.3331995Z moe/activation_test.py:126: 2025-05-07T20:32:09.3332298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3332635Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:09.3332963Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:09.3333897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:09.3334642Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:09.3335186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3335860Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3336547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:09.3337271Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:09.3337982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:09.3338616Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:09.3339228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:09.3339742Z fn() 2025-05-07T20:32:09.3340242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:09.3340825Z self.fn.run( 2025-05-07T20:32:09.3341293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3341811Z kernel = self.compile( 2025-05-07T20:32:09.3342351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3343009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3343407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3343631Z 2025-05-07T20:32:09.3343837Z self = 2025-05-07T20:32:09.3344914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3346292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce612f36a0>} 2025-05-07T20:32:09.3347624Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3348645Z context = 2025-05-07T20:32:09.3348930Z 2025-05-07T20:32:09.3349099Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3349709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3350177Z module_map=module_map) 2025-05-07T20:32:09.3350542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3350902Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:09.3351175Z E ^ 2025-05-07T20:32:09.3351640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3352079Z 2025-05-07T20:32:09.3352490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.3353002Z 2025-05-07T20:32:09.3353195Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:09.3353613Z self=, 2025-05-07T20:32:09.3354021Z T=2048, 2025-05-07T20:32:09.3354212Z D=5120, 2025-05-07T20:32:09.3354420Z scale_ub=1200.0, 2025-05-07T20:32:09.3354646Z contiguous=True, 2025-05-07T20:32:09.3354867Z compiled=False, 2025-05-07T20:32:09.3355079Z ) 2025-05-07T20:32:09.3355401Z self = 2025-05-07T20:32:09.3355885Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:09.3356163Z 2025-05-07T20:32:09.3356245Z @given( 2025-05-07T20:32:09.3356476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:09.3356787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:09.3357098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:09.3357431Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:09.3357808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:09.3358096Z ) 2025-05-07T20:32:09.3358448Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:09.3358889Z def test_silu_mul_quant( 2025-05-07T20:32:09.3359131Z self, 2025-05-07T20:32:09.3359330Z T: int, 2025-05-07T20:32:09.3359527Z D: int, 2025-05-07T20:32:09.3359743Z scale_ub: Optional[float], 2025-05-07T20:32:09.3360018Z contiguous: bool, 2025-05-07T20:32:09.3360261Z compiled: bool, 2025-05-07T20:32:09.3360479Z ) -> None: 2025-05-07T20:32:09.3360697Z torch.manual_seed(2025) 2025-05-07T20:32:09.3360940Z 2025-05-07T20:32:09.3361206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:09.3361551Z 2025-05-07T20:32:09.3361752Z x_sign = torch.sign(x) 2025-05-07T20:32:09.3362037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:09.3362359Z x = x_sign * x_clamp 2025-05-07T20:32:09.3362601Z x0 = x[:, :D] 2025-05-07T20:32:09.3362815Z x1 = x[:, D:] 2025-05-07T20:32:09.3363023Z 2025-05-07T20:32:09.3363211Z if contiguous: 2025-05-07T20:32:09.3363445Z x0 = x0.contiguous() 2025-05-07T20:32:09.3363703Z x1 = x1.contiguous() 2025-05-07T20:32:09.3363946Z 2025-05-07T20:32:09.3364140Z if scale_ub is not None: 2025-05-07T20:32:09.3364408Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:09.3364743Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:09.3365054Z ) 2025-05-07T20:32:09.3365246Z else: 2025-05-07T20:32:09.3365466Z scale_ub_tensor = None 2025-05-07T20:32:09.3365725Z 2025-05-07T20:32:09.3365952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:09.3366272Z op = silu_mul_quant 2025-05-07T20:32:09.3366526Z if compiled: 2025-05-07T20:32:09.3366775Z op = torch.compile(op) 2025-05-07T20:32:09.3367079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3367358Z 2025-05-07T20:32:09.3367550Z > y_fp8, y_scale = fn() 2025-05-07T20:32:09.3367722Z 2025-05-07T20:32:09.3367914Z moe/activation_test.py:117: 2025-05-07T20:32:09.3368210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3368544Z moe/activation_test.py:115: in fn 2025-05-07T20:32:09.3368822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:09.3369507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:09.3370193Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:09.3370722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:09.3371399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:09.3372175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:09.3372707Z kernel = self.compile( 2025-05-07T20:32:09.3373238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:09.3374025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.3374426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:09.3374651Z 2025-05-07T20:32:09.3374867Z self = 2025-05-07T20:32:09.3375922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:09.3377277Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60f61f80>} 2025-05-07T20:32:09.3378658Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:09.3386152Z context = 2025-05-07T20:32:09.3386451Z 2025-05-07T20:32:09.3386629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:09.3387149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.3387625Z module_map=module_map) 2025-05-07T20:32:09.3388002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.3388355Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.3388627Z E ^ 2025-05-07T20:32:09.3389108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.3389555Z 2025-05-07T20:32:09.3389983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:09.7259612Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.7260813Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:09.7262141Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.7263603Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.7264574Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.7266228Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.7267585Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.7269066Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.7270426Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.7271468Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:09.7272708Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.7273932Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:09.7274767Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:09.7275952Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.7277148Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:09.7278170Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:09.7279170Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:09.7280370Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.7281638Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.7282538Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:09.7283612Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:09.7284633Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:09.7285398Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:09.7286556Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.7287917Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.7289074Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.7289970Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.7290714Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:09.7291802Z W0507 20:32:09.721000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:09.8043405Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:09.8044613Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:32:09.8045934Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:09.8047342Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:09.8048332Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:09.8049624Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:09.8050993Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:09.8052280Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:09.8053764Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:09.8054808Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] module_map=module_map) 2025-05-07T20:32:09.8056063Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:09.8057297Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:32:09.8058194Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:09.8059390Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:09.8060593Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:32:09.8061968Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:09.8062975Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:32:09.8064183Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:09.8065579Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:09.8066483Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:09.8067570Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:09.8068602Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:32:09.8069363Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:09.8070517Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:09.8071860Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:09.8072918Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:09.8073832Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:09.8074567Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:32:09.8075580Z W0507 20:32:09.800000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3822258Z 2025-05-07T20:32:10.3822735Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3823445Z self=, 2025-05-07T20:32:10.3824004Z T=2048, 2025-05-07T20:32:10.3824299Z D=5120, 2025-05-07T20:32:10.3824503Z scale_ub=1200.0, 2025-05-07T20:32:10.3824736Z contiguous=True, 2025-05-07T20:32:10.3824957Z compiled=True, 2025-05-07T20:32:10.3825166Z ) 2025-05-07T20:32:10.3825486Z self = 2025-05-07T20:32:10.3825981Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:10.3826260Z 2025-05-07T20:32:10.3826340Z @given( 2025-05-07T20:32:10.3826574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3826880Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3827190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3827521Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3827843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3828140Z ) 2025-05-07T20:32:10.3828489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3828930Z def test_silu_mul_quant( 2025-05-07T20:32:10.3829180Z self, 2025-05-07T20:32:10.3829777Z T: int, 2025-05-07T20:32:10.3829969Z D: int, 2025-05-07T20:32:10.3830195Z scale_ub: Optional[float], 2025-05-07T20:32:10.3830469Z contiguous: bool, 2025-05-07T20:32:10.3830712Z compiled: bool, 2025-05-07T20:32:10.3830938Z ) -> None: 2025-05-07T20:32:10.3831161Z torch.manual_seed(2025) 2025-05-07T20:32:10.3831406Z 2025-05-07T20:32:10.3831677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3832020Z 2025-05-07T20:32:10.3832214Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3832496Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3832805Z x = x_sign * x_clamp 2025-05-07T20:32:10.3833189Z x0 = x[:, :D] 2025-05-07T20:32:10.3833400Z x1 = x[:, D:] 2025-05-07T20:32:10.3833609Z 2025-05-07T20:32:10.3833794Z if contiguous: 2025-05-07T20:32:10.3834018Z x0 = x0.contiguous() 2025-05-07T20:32:10.3834281Z x1 = x1.contiguous() 2025-05-07T20:32:10.3834520Z 2025-05-07T20:32:10.3834703Z if scale_ub is not None: 2025-05-07T20:32:10.3834976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3835310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3835617Z ) 2025-05-07T20:32:10.3835805Z else: 2025-05-07T20:32:10.3836012Z scale_ub_tensor = None 2025-05-07T20:32:10.3836260Z 2025-05-07T20:32:10.3836482Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3836794Z op = silu_mul_quant 2025-05-07T20:32:10.3837042Z if compiled: 2025-05-07T20:32:10.3837286Z op = torch.compile(op) 2025-05-07T20:32:10.3837590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3837873Z 2025-05-07T20:32:10.3838083Z y_fp8, y_scale = fn() 2025-05-07T20:32:10.3838394Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:10.3838692Z 2025-05-07T20:32:10.3838922Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3839257Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:10.3839547Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:10.3839851Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:10.3840211Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.3840525Z 2025-05-07T20:32:10.3840729Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:10.3840920Z 2025-05-07T20:32:10.3841020Z moe/activation_test.py:126: 2025-05-07T20:32:10.3841315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3841654Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:10.3841975Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:10.3842758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:10.3843505Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:10.3844053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.3844722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.3845398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:10.3846112Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:10.3846823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:10.3847450Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:10.3848046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:10.3848646Z fn() 2025-05-07T20:32:10.3849138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:10.3849716Z self.fn.run( 2025-05-07T20:32:10.3850178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.3850699Z kernel = self.compile( 2025-05-07T20:32:10.3851225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.3851867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3852265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3852495Z 2025-05-07T20:32:10.3852776Z self = 2025-05-07T20:32:10.3854002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.3855380Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60bf07c0>} 2025-05-07T20:32:10.3856700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.3857709Z context = 2025-05-07T20:32:10.3857992Z 2025-05-07T20:32:10.3858165Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.3858677Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3859140Z module_map=module_map) 2025-05-07T20:32:10.3859507Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3859860Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:10.3860128Z E ^ 2025-05-07T20:32:10.3860584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3861021Z 2025-05-07T20:32:10.3861427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.3861936Z 2025-05-07T20:32:10.3862042Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:10.3862452Z self=, 2025-05-07T20:32:10.3862849Z T=16384, 2025-05-07T20:32:10.3863043Z D=7168, 2025-05-07T20:32:10.3863236Z scale_ub=1200.0, 2025-05-07T20:32:10.3863458Z contiguous=False, 2025-05-07T20:32:10.3863681Z compiled=False, 2025-05-07T20:32:10.3863887Z ) 2025-05-07T20:32:10.3864203Z self = 2025-05-07T20:32:10.3864693Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:10.3864969Z 2025-05-07T20:32:10.3865065Z @given( 2025-05-07T20:32:10.3865294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:10.3865609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:10.3865913Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:10.3866236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:10.3866562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:10.3866846Z ) 2025-05-07T20:32:10.3867194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:10.3867634Z def test_silu_mul_quant( 2025-05-07T20:32:10.3867877Z self, 2025-05-07T20:32:10.3868099Z T: int, 2025-05-07T20:32:10.3868320Z D: int, 2025-05-07T20:32:10.3868540Z scale_ub: Optional[float], 2025-05-07T20:32:10.3869093Z contiguous: bool, 2025-05-07T20:32:10.3869323Z compiled: bool, 2025-05-07T20:32:10.3869547Z ) -> None: 2025-05-07T20:32:10.3869761Z torch.manual_seed(2025) 2025-05-07T20:32:10.3869998Z 2025-05-07T20:32:10.3870268Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:10.3870605Z 2025-05-07T20:32:10.3870792Z x_sign = torch.sign(x) 2025-05-07T20:32:10.3871078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:10.3871383Z x = x_sign * x_clamp 2025-05-07T20:32:10.3871613Z x0 = x[:, :D] 2025-05-07T20:32:10.3871829Z x1 = x[:, D:] 2025-05-07T20:32:10.3872035Z 2025-05-07T20:32:10.3872212Z if contiguous: 2025-05-07T20:32:10.3872523Z x0 = x0.contiguous() 2025-05-07T20:32:10.3872778Z x1 = x1.contiguous() 2025-05-07T20:32:10.3873014Z 2025-05-07T20:32:10.3873199Z if scale_ub is not None: 2025-05-07T20:32:10.3873478Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:10.3873814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:10.3874115Z ) 2025-05-07T20:32:10.3874307Z else: 2025-05-07T20:32:10.3874516Z scale_ub_tensor = None 2025-05-07T20:32:10.3874757Z 2025-05-07T20:32:10.3874984Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:10.3875298Z op = silu_mul_quant 2025-05-07T20:32:10.3875539Z if compiled: 2025-05-07T20:32:10.3875790Z op = torch.compile(op) 2025-05-07T20:32:10.3876087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3876355Z 2025-05-07T20:32:10.3876549Z > y_fp8, y_scale = fn() 2025-05-07T20:32:10.3876716Z 2025-05-07T20:32:10.3876821Z moe/activation_test.py:117: 2025-05-07T20:32:10.3877114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3877439Z moe/activation_test.py:115: in fn 2025-05-07T20:32:10.3877725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:10.3878404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:10.3879079Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:10.3879611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:10.3880286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:10.3880942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:10.3881466Z kernel = self.compile( 2025-05-07T20:32:10.3882009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:10.3882660Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.3883052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:10.3883286Z 2025-05-07T20:32:10.3883492Z self = 2025-05-07T20:32:10.3884555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:10.3885902Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60bd5440>} 2025-05-07T20:32:10.3887229Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:10.3888231Z context = 2025-05-07T20:32:10.3888599Z 2025-05-07T20:32:10.3888763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:10.3889275Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.3889741Z module_map=module_map) 2025-05-07T20:32:10.3890097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.3890448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.3890707Z E ^ 2025-05-07T20:32:10.3891151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.3891596Z 2025-05-07T20:32:10.3892099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:10.6106404Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.6107627Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:10.6108998Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.6110403Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.6111382Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6112663Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.6114020Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6115302Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.6116646Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6117680Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:10.6118923Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.6120143Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:10.6120977Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:10.6122160Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.6123352Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:10.6124741Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:10.6125741Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:10.6126938Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.6128329Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.6129226Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:10.6130298Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:10.6131325Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:10.6132077Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:10.6133227Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.6134715Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.6135764Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6136672Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6137402Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:10.6138415Z W0507 20:32:10.606000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:10.6649301Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:10.6650585Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:32:10.6651913Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:10.6653327Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:10.6654370Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:10.6655666Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:10.6657375Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:10.6658665Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:10.6660022Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:10.6661182Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] module_map=module_map) 2025-05-07T20:32:10.6662431Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:10.6663675Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:32:10.6664520Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:10.6665715Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:10.6666915Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:32:10.6667945Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:10.6668963Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:32:10.6670170Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:10.6671428Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:10.6672336Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:10.6673416Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:10.6674451Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:32:10.6675224Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:10.6676372Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:10.6677717Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:10.6678830Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:10.6679739Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:10.6680555Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:32:10.6681571Z W0507 20:32:10.661000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1147372Z 2025-05-07T20:32:11.1147774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1148306Z self=, 2025-05-07T20:32:11.1148716Z T=1, 2025-05-07T20:32:11.1148939Z D=7168, 2025-05-07T20:32:11.1149509Z scale_ub=None, 2025-05-07T20:32:11.1149730Z contiguous=True, 2025-05-07T20:32:11.1149961Z compiled=True, 2025-05-07T20:32:11.1150171Z ) 2025-05-07T20:32:11.1150493Z self = 2025-05-07T20:32:11.1150989Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:11.1151245Z 2025-05-07T20:32:11.1151332Z @given( 2025-05-07T20:32:11.1151564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1151884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1152196Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1152527Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1152852Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1153144Z ) 2025-05-07T20:32:11.1153494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1153932Z def test_silu_mul_quant( 2025-05-07T20:32:11.1154187Z self, 2025-05-07T20:32:11.1154387Z T: int, 2025-05-07T20:32:11.1154583Z D: int, 2025-05-07T20:32:11.1154802Z scale_ub: Optional[float], 2025-05-07T20:32:11.1155076Z contiguous: bool, 2025-05-07T20:32:11.1155318Z compiled: bool, 2025-05-07T20:32:11.1155548Z ) -> None: 2025-05-07T20:32:11.1155770Z torch.manual_seed(2025) 2025-05-07T20:32:11.1156007Z 2025-05-07T20:32:11.1156283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1156626Z 2025-05-07T20:32:11.1156822Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1157108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1157420Z x = x_sign * x_clamp 2025-05-07T20:32:11.1157666Z x0 = x[:, :D] 2025-05-07T20:32:11.1157880Z x1 = x[:, D:] 2025-05-07T20:32:11.1158107Z 2025-05-07T20:32:11.1158330Z if contiguous: 2025-05-07T20:32:11.1158565Z x0 = x0.contiguous() 2025-05-07T20:32:11.1158828Z x1 = x1.contiguous() 2025-05-07T20:32:11.1159070Z 2025-05-07T20:32:11.1159257Z if scale_ub is not None: 2025-05-07T20:32:11.1159533Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1159874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1160182Z ) 2025-05-07T20:32:11.1160378Z else: 2025-05-07T20:32:11.1160591Z scale_ub_tensor = None 2025-05-07T20:32:11.1160839Z 2025-05-07T20:32:11.1161076Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1167467Z op = silu_mul_quant 2025-05-07T20:32:11.1167764Z if compiled: 2025-05-07T20:32:11.1168020Z op = torch.compile(op) 2025-05-07T20:32:11.1168314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1168593Z 2025-05-07T20:32:11.1168788Z y_fp8, y_scale = fn() 2025-05-07T20:32:11.1169075Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:11.1169369Z 2025-05-07T20:32:11.1169611Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1169939Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:11.1170231Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:11.1170767Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:11.1171121Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.1171433Z 2025-05-07T20:32:11.1171635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:11.1171826Z 2025-05-07T20:32:11.1171932Z moe/activation_test.py:126: 2025-05-07T20:32:11.1172222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1172556Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:11.1172886Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:11.1173910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:11.1174659Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:11.1175204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1175881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1176550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:11.1177265Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:11.1177981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:11.1178658Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:11.1179251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:11.1179769Z fn() 2025-05-07T20:32:11.1180272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:11.1180839Z self.fn.run( 2025-05-07T20:32:11.1181310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1181831Z kernel = self.compile( 2025-05-07T20:32:11.1182370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1183006Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1183407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1183634Z 2025-05-07T20:32:11.1183848Z self = 2025-05-07T20:32:11.1184920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1186280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b40b7e0>} 2025-05-07T20:32:11.1187610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1188612Z context = 2025-05-07T20:32:11.1188895Z 2025-05-07T20:32:11.1189065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1189572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1190039Z module_map=module_map) 2025-05-07T20:32:11.1190394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1190750Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:11.1191016Z E ^ 2025-05-07T20:32:11.1191555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1192000Z 2025-05-07T20:32:11.1192406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.1192916Z 2025-05-07T20:32:11.1193017Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:11.1193424Z self=, 2025-05-07T20:32:11.1193814Z T=4096, 2025-05-07T20:32:11.1194002Z D=5120, 2025-05-07T20:32:11.1194203Z scale_ub=None, 2025-05-07T20:32:11.1194419Z contiguous=False, 2025-05-07T20:32:11.1194646Z compiled=False, 2025-05-07T20:32:11.1194854Z ) 2025-05-07T20:32:11.1195245Z self = 2025-05-07T20:32:11.1195736Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:11.1196016Z 2025-05-07T20:32:11.1196099Z @given( 2025-05-07T20:32:11.1196330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:11.1196633Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:11.1196939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:11.1197261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:11.1197584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:11.1197872Z ) 2025-05-07T20:32:11.1198475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:11.1198912Z def test_silu_mul_quant( 2025-05-07T20:32:11.1199149Z self, 2025-05-07T20:32:11.1199346Z T: int, 2025-05-07T20:32:11.1199533Z D: int, 2025-05-07T20:32:11.1199752Z scale_ub: Optional[float], 2025-05-07T20:32:11.1200021Z contiguous: bool, 2025-05-07T20:32:11.1200257Z compiled: bool, 2025-05-07T20:32:11.1200470Z ) -> None: 2025-05-07T20:32:11.1200682Z torch.manual_seed(2025) 2025-05-07T20:32:11.1200926Z 2025-05-07T20:32:11.1201186Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:11.1201525Z 2025-05-07T20:32:11.1201713Z x_sign = torch.sign(x) 2025-05-07T20:32:11.1201994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:11.1202298Z x = x_sign * x_clamp 2025-05-07T20:32:11.1202537Z x0 = x[:, :D] 2025-05-07T20:32:11.1202744Z x1 = x[:, D:] 2025-05-07T20:32:11.1202951Z 2025-05-07T20:32:11.1203135Z if contiguous: 2025-05-07T20:32:11.1203358Z x0 = x0.contiguous() 2025-05-07T20:32:11.1203611Z x1 = x1.contiguous() 2025-05-07T20:32:11.1203848Z 2025-05-07T20:32:11.1204036Z if scale_ub is not None: 2025-05-07T20:32:11.1204313Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:11.1204647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:11.1204954Z ) 2025-05-07T20:32:11.1205144Z else: 2025-05-07T20:32:11.1205357Z scale_ub_tensor = None 2025-05-07T20:32:11.1205608Z 2025-05-07T20:32:11.1205835Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:11.1206149Z op = silu_mul_quant 2025-05-07T20:32:11.1206405Z if compiled: 2025-05-07T20:32:11.1206645Z op = torch.compile(op) 2025-05-07T20:32:11.1206940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1207222Z 2025-05-07T20:32:11.1207405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:11.1207576Z 2025-05-07T20:32:11.1207671Z moe/activation_test.py:117: 2025-05-07T20:32:11.1207966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1208289Z moe/activation_test.py:115: in fn 2025-05-07T20:32:11.1208573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:11.1209252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:11.1210075Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:11.1210598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:11.1211265Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:11.1211919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:11.1212443Z kernel = self.compile( 2025-05-07T20:32:11.1212968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:11.1213613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.1214211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:11.1214436Z 2025-05-07T20:32:11.1214639Z self = 2025-05-07T20:32:11.1215707Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:11.1217052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba78400>} 2025-05-07T20:32:11.1218368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:11.1219375Z context = 2025-05-07T20:32:11.1219655Z 2025-05-07T20:32:11.1219818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:11.1220328Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.1220792Z module_map=module_map) 2025-05-07T20:32:11.1221144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.1221492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.1221746Z E ^ 2025-05-07T20:32:11.1222199Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.1222632Z 2025-05-07T20:32:11.1223037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:11.3998778Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.3999897Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:11.4001233Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.4002650Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.4003629Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.4004914Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.4006280Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.4007925Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.4009288Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.4010462Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:11.4011708Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.4012953Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:11.4013920Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:11.4015113Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.4016314Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:11.4017336Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:11.4018359Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:11.4019564Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.4020837Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.4021736Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:11.4022818Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:11.4023858Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:11.4024630Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:11.4025785Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.4027123Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.4028186Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.4029096Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.4029956Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:11.4030962Z W0507 20:32:11.396000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:11.5815793Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:11.5817621Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:32:11.5818959Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:11.5820391Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:11.5821366Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:11.5822658Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:11.5824028Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:11.5825310Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:11.5826675Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:11.5827717Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] module_map=module_map) 2025-05-07T20:32:11.5828977Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:11.5830215Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:32:11.5831058Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:11.5832255Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:11.5833453Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:32:11.5834485Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:11.5835494Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:32:11.5836852Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:11.5838126Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:11.5839084Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:11.5840237Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:11.5841293Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:32:11.5842069Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:11.5843240Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:11.5844578Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:11.5845642Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:11.5846566Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:11.5847314Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:32:11.5848327Z W0507 20:32:11.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1093321Z 2025-05-07T20:32:12.1094086Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1094737Z self=, 2025-05-07T20:32:12.1095226Z T=4096, 2025-05-07T20:32:12.1095422Z D=7168, 2025-05-07T20:32:12.1095615Z scale_ub=None, 2025-05-07T20:32:12.1095827Z contiguous=False, 2025-05-07T20:32:12.1096052Z compiled=False, 2025-05-07T20:32:12.1096266Z ) 2025-05-07T20:32:12.1096610Z self = 2025-05-07T20:32:12.1097104Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.1097372Z 2025-05-07T20:32:12.1097458Z @given( 2025-05-07T20:32:12.1097694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1098010Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1098574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1098897Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1099228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1099518Z ) 2025-05-07T20:32:12.1099866Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1100305Z def test_silu_mul_quant( 2025-05-07T20:32:12.1100550Z self, 2025-05-07T20:32:12.1100751Z T: int, 2025-05-07T20:32:12.1100943Z D: int, 2025-05-07T20:32:12.1101163Z scale_ub: Optional[float], 2025-05-07T20:32:12.1101440Z contiguous: bool, 2025-05-07T20:32:12.1101679Z compiled: bool, 2025-05-07T20:32:12.1101913Z ) -> None: 2025-05-07T20:32:12.1102131Z torch.manual_seed(2025) 2025-05-07T20:32:12.1102368Z 2025-05-07T20:32:12.1103036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1103387Z 2025-05-07T20:32:12.1103578Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1103869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1104181Z x = x_sign * x_clamp 2025-05-07T20:32:12.1104422Z x0 = x[:, :D] 2025-05-07T20:32:12.1104636Z x1 = x[:, D:] 2025-05-07T20:32:12.1104851Z 2025-05-07T20:32:12.1105041Z if contiguous: 2025-05-07T20:32:12.1105268Z x0 = x0.contiguous() 2025-05-07T20:32:12.1105528Z x1 = x1.contiguous() 2025-05-07T20:32:12.1105770Z 2025-05-07T20:32:12.1105960Z if scale_ub is not None: 2025-05-07T20:32:12.1106391Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1106731Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1107035Z ) 2025-05-07T20:32:12.1107234Z else: 2025-05-07T20:32:12.1107450Z scale_ub_tensor = None 2025-05-07T20:32:12.1107694Z 2025-05-07T20:32:12.1107926Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1108259Z op = silu_mul_quant 2025-05-07T20:32:12.1108533Z if compiled: 2025-05-07T20:32:12.1108780Z op = torch.compile(op) 2025-05-07T20:32:12.1109074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1109345Z 2025-05-07T20:32:12.1109546Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.1109714Z 2025-05-07T20:32:12.1109812Z moe/activation_test.py:117: 2025-05-07T20:32:12.1110109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1110432Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.1110720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1111407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.1112091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.1112625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1113302Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1113959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1114484Z kernel = self.compile( 2025-05-07T20:32:12.1115020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1115669Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1116071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1116295Z 2025-05-07T20:32:12.1116500Z self = 2025-05-07T20:32:12.1117565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1118938Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba7b600>} 2025-05-07T20:32:12.1120259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1121256Z context = 2025-05-07T20:32:12.1121551Z 2025-05-07T20:32:12.1121714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1122227Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1122795Z module_map=module_map) 2025-05-07T20:32:12.1123151Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1123505Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.1123766Z E ^ 2025-05-07T20:32:12.1124216Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1124661Z 2025-05-07T20:32:12.1125071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.1125581Z 2025-05-07T20:32:12.1125682Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.1126180Z self=, 2025-05-07T20:32:12.1126574Z T=128, 2025-05-07T20:32:12.1126760Z D=7168, 2025-05-07T20:32:12.1126958Z scale_ub=None, 2025-05-07T20:32:12.1127178Z contiguous=False, 2025-05-07T20:32:12.1127401Z compiled=True, 2025-05-07T20:32:12.1127616Z ) 2025-05-07T20:32:12.1127928Z self = 2025-05-07T20:32:12.1128418Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:12.1128690Z 2025-05-07T20:32:12.1128766Z @given( 2025-05-07T20:32:12.1128997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.1129304Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.1129607Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.1129936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.1130252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.1130539Z ) 2025-05-07T20:32:12.1130889Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.1131319Z def test_silu_mul_quant( 2025-05-07T20:32:12.1131558Z self, 2025-05-07T20:32:12.1131752Z T: int, 2025-05-07T20:32:12.1131945Z D: int, 2025-05-07T20:32:12.1132163Z scale_ub: Optional[float], 2025-05-07T20:32:12.1132431Z contiguous: bool, 2025-05-07T20:32:12.1132671Z compiled: bool, 2025-05-07T20:32:12.1132889Z ) -> None: 2025-05-07T20:32:12.1133102Z torch.manual_seed(2025) 2025-05-07T20:32:12.1133345Z 2025-05-07T20:32:12.1133609Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.1134063Z 2025-05-07T20:32:12.1134259Z x_sign = torch.sign(x) 2025-05-07T20:32:12.1134539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.1134847Z x = x_sign * x_clamp 2025-05-07T20:32:12.1135086Z x0 = x[:, :D] 2025-05-07T20:32:12.1135294Z x1 = x[:, D:] 2025-05-07T20:32:12.1135508Z 2025-05-07T20:32:12.1135696Z if contiguous: 2025-05-07T20:32:12.1135917Z x0 = x0.contiguous() 2025-05-07T20:32:12.1136172Z x1 = x1.contiguous() 2025-05-07T20:32:12.1136412Z 2025-05-07T20:32:12.1136601Z if scale_ub is not None: 2025-05-07T20:32:12.1136875Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.1137209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.1137517Z ) 2025-05-07T20:32:12.1137710Z else: 2025-05-07T20:32:12.1137920Z scale_ub_tensor = None 2025-05-07T20:32:12.1138176Z 2025-05-07T20:32:12.1138402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1138715Z op = silu_mul_quant 2025-05-07T20:32:12.1138964Z if compiled: 2025-05-07T20:32:12.1139204Z op = torch.compile(op) 2025-05-07T20:32:12.1139500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.1139773Z 2025-05-07T20:32:12.1139966Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.1140251Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.1140547Z 2025-05-07T20:32:12.1140779Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.1141244Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.1141541Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.1141845Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.1142202Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.1142509Z 2025-05-07T20:32:12.1142710Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.1142900Z 2025-05-07T20:32:12.1142998Z moe/activation_test.py:126: 2025-05-07T20:32:12.1143291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1143619Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.1144018Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.1144791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.1145534Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.1146075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.1146735Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.1147414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.1148126Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.1148890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.1149515Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.1150110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.1150619Z fn() 2025-05-07T20:32:12.1151112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.1151690Z self.fn.run( 2025-05-07T20:32:12.1152150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.1152670Z kernel = self.compile( 2025-05-07T20:32:12.1153195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.1153837Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.1154231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.1154454Z 2025-05-07T20:32:12.1154662Z self = 2025-05-07T20:32:12.1155722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.1157077Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba7a020>} 2025-05-07T20:32:12.1158402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.1159405Z context = 2025-05-07T20:32:12.1159692Z 2025-05-07T20:32:12.1159853Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.1160374Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.1160837Z module_map=module_map) 2025-05-07T20:32:12.1161202Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.1161642Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.1161910Z E ^ 2025-05-07T20:32:12.1162380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.1169215Z 2025-05-07T20:32:12.1169662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3544616Z 2025-05-07T20:32:12.3545289Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3545975Z self=, 2025-05-07T20:32:12.3546536Z T=128, 2025-05-07T20:32:12.3546741Z D=7168, 2025-05-07T20:32:12.3546947Z scale_ub=None, 2025-05-07T20:32:12.3547557Z contiguous=False, 2025-05-07T20:32:12.3547795Z compiled=False, 2025-05-07T20:32:12.3548019Z ) 2025-05-07T20:32:12.3548375Z self = 2025-05-07T20:32:12.3548905Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:12.3549182Z 2025-05-07T20:32:12.3549265Z @given( 2025-05-07T20:32:12.3549509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3549836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3550145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3550484Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3550824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3551115Z ) 2025-05-07T20:32:12.3551473Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3551924Z def test_silu_mul_quant( 2025-05-07T20:32:12.3552174Z self, 2025-05-07T20:32:12.3552380Z T: int, 2025-05-07T20:32:12.3552593Z D: int, 2025-05-07T20:32:12.3552814Z scale_ub: Optional[float], 2025-05-07T20:32:12.3553099Z contiguous: bool, 2025-05-07T20:32:12.3553353Z compiled: bool, 2025-05-07T20:32:12.3553585Z ) -> None: 2025-05-07T20:32:12.3553814Z torch.manual_seed(2025) 2025-05-07T20:32:12.3554068Z 2025-05-07T20:32:12.3554352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3554695Z 2025-05-07T20:32:12.3554900Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3555200Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3555510Z x = x_sign * x_clamp 2025-05-07T20:32:12.3555761Z x0 = x[:, :D] 2025-05-07T20:32:12.3555985Z x1 = x[:, D:] 2025-05-07T20:32:12.3556195Z 2025-05-07T20:32:12.3556392Z if contiguous: 2025-05-07T20:32:12.3556632Z x0 = x0.contiguous() 2025-05-07T20:32:12.3556897Z x1 = x1.contiguous() 2025-05-07T20:32:12.3557147Z 2025-05-07T20:32:12.3557344Z if scale_ub is not None: 2025-05-07T20:32:12.3557632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3557977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3558285Z ) 2025-05-07T20:32:12.3558519Z else: 2025-05-07T20:32:12.3558758Z scale_ub_tensor = None 2025-05-07T20:32:12.3559011Z 2025-05-07T20:32:12.3559253Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3559577Z op = silu_mul_quant 2025-05-07T20:32:12.3559828Z if compiled: 2025-05-07T20:32:12.3560087Z op = torch.compile(op) 2025-05-07T20:32:12.3560393Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3560672Z 2025-05-07T20:32:12.3560874Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3561047Z 2025-05-07T20:32:12.3561154Z moe/activation_test.py:117: 2025-05-07T20:32:12.3561463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3561797Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3562085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3562943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3563623Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3564162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3564840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3565502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3566034Z kernel = self.compile( 2025-05-07T20:32:12.3566661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3567320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3567716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3567957Z 2025-05-07T20:32:12.3568164Z self = 2025-05-07T20:32:12.3569281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3570649Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b278680>} 2025-05-07T20:32:12.3571988Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3572993Z context = 2025-05-07T20:32:12.3573286Z 2025-05-07T20:32:12.3573450Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3574083Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3574553Z module_map=module_map) 2025-05-07T20:32:12.3574914Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3575269Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3575538Z E ^ 2025-05-07T20:32:12.3575994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3576442Z 2025-05-07T20:32:12.3576854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3577373Z 2025-05-07T20:32:12.3577479Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3577893Z self=, 2025-05-07T20:32:12.3578298Z T=4096, 2025-05-07T20:32:12.3578493Z D=5120, 2025-05-07T20:32:12.3578691Z scale_ub=1200.0, 2025-05-07T20:32:12.3578915Z contiguous=True, 2025-05-07T20:32:12.3579143Z compiled=False, 2025-05-07T20:32:12.3579356Z ) 2025-05-07T20:32:12.3579673Z self = 2025-05-07T20:32:12.3580170Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:12.3580451Z 2025-05-07T20:32:12.3580534Z @given( 2025-05-07T20:32:12.3580768Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3581080Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3581393Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3581734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3582062Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3582352Z ) 2025-05-07T20:32:12.3582707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3583240Z def test_silu_mul_quant( 2025-05-07T20:32:12.3583485Z self, 2025-05-07T20:32:12.3583688Z T: int, 2025-05-07T20:32:12.3583886Z D: int, 2025-05-07T20:32:12.3584111Z scale_ub: Optional[float], 2025-05-07T20:32:12.3584389Z contiguous: bool, 2025-05-07T20:32:12.3584626Z compiled: bool, 2025-05-07T20:32:12.3584853Z ) -> None: 2025-05-07T20:32:12.3585069Z torch.manual_seed(2025) 2025-05-07T20:32:12.3585313Z 2025-05-07T20:32:12.3585584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3585930Z 2025-05-07T20:32:12.3586133Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3586569Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3586883Z x = x_sign * x_clamp 2025-05-07T20:32:12.3587126Z x0 = x[:, :D] 2025-05-07T20:32:12.3587339Z x1 = x[:, D:] 2025-05-07T20:32:12.3587553Z 2025-05-07T20:32:12.3587750Z if contiguous: 2025-05-07T20:32:12.3587978Z x0 = x0.contiguous() 2025-05-07T20:32:12.3588242Z x1 = x1.contiguous() 2025-05-07T20:32:12.3588492Z 2025-05-07T20:32:12.3588683Z if scale_ub is not None: 2025-05-07T20:32:12.3588961Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3589295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3589605Z ) 2025-05-07T20:32:12.3589803Z else: 2025-05-07T20:32:12.3590016Z scale_ub_tensor = None 2025-05-07T20:32:12.3590272Z 2025-05-07T20:32:12.3590501Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3590826Z op = silu_mul_quant 2025-05-07T20:32:12.3591085Z if compiled: 2025-05-07T20:32:12.3591330Z op = torch.compile(op) 2025-05-07T20:32:12.3591633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3591913Z 2025-05-07T20:32:12.3592105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:12.3592285Z 2025-05-07T20:32:12.3592385Z moe/activation_test.py:117: 2025-05-07T20:32:12.3592686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3593018Z moe/activation_test.py:115: in fn 2025-05-07T20:32:12.3593309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3594002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:12.3594695Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:12.3595230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3595922Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3596588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3597119Z kernel = self.compile( 2025-05-07T20:32:12.3597667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3598622Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3599030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3599258Z 2025-05-07T20:32:12.3599466Z self = 2025-05-07T20:32:12.3600537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3601907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b278f40>} 2025-05-07T20:32:12.3603243Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3604387Z context = 2025-05-07T20:32:12.3604670Z 2025-05-07T20:32:12.3604834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3605353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3605820Z module_map=module_map) 2025-05-07T20:32:12.3606184Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3606541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.3606932Z E ^ 2025-05-07T20:32:12.3607395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3607837Z 2025-05-07T20:32:12.3608251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.3608821Z 2025-05-07T20:32:12.3608925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:12.3609343Z self=, 2025-05-07T20:32:12.3609747Z T=1, 2025-05-07T20:32:12.3609931Z D=5120, 2025-05-07T20:32:12.3610132Z scale_ub=None, 2025-05-07T20:32:12.3610351Z contiguous=True, 2025-05-07T20:32:12.3610572Z compiled=True, 2025-05-07T20:32:12.3610783Z ) 2025-05-07T20:32:12.3611110Z self = 2025-05-07T20:32:12.3611587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:12.3611855Z 2025-05-07T20:32:12.3611936Z @given( 2025-05-07T20:32:12.3612174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:12.3612485Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:12.3612800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:12.3613134Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:12.3613462Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:12.3613854Z ) 2025-05-07T20:32:12.3614210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:12.3614658Z def test_silu_mul_quant( 2025-05-07T20:32:12.3614900Z self, 2025-05-07T20:32:12.3615104Z T: int, 2025-05-07T20:32:12.3615307Z D: int, 2025-05-07T20:32:12.3615524Z scale_ub: Optional[float], 2025-05-07T20:32:12.3615800Z contiguous: bool, 2025-05-07T20:32:12.3616055Z compiled: bool, 2025-05-07T20:32:12.3616283Z ) -> None: 2025-05-07T20:32:12.3616508Z torch.manual_seed(2025) 2025-05-07T20:32:12.3616748Z 2025-05-07T20:32:12.3617030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:12.3617383Z 2025-05-07T20:32:12.3617581Z x_sign = torch.sign(x) 2025-05-07T20:32:12.3617877Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:12.3618191Z x = x_sign * x_clamp 2025-05-07T20:32:12.3618464Z x0 = x[:, :D] 2025-05-07T20:32:12.3618705Z x1 = x[:, D:] 2025-05-07T20:32:12.3618921Z 2025-05-07T20:32:12.3619104Z if contiguous: 2025-05-07T20:32:12.3619343Z x0 = x0.contiguous() 2025-05-07T20:32:12.3619605Z x1 = x1.contiguous() 2025-05-07T20:32:12.3619844Z 2025-05-07T20:32:12.3620047Z if scale_ub is not None: 2025-05-07T20:32:12.3620325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:12.3620659Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:12.3620974Z ) 2025-05-07T20:32:12.3621174Z else: 2025-05-07T20:32:12.3621390Z scale_ub_tensor = None 2025-05-07T20:32:12.3621639Z 2025-05-07T20:32:12.3621877Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3622300Z op = silu_mul_quant 2025-05-07T20:32:12.3622548Z if compiled: 2025-05-07T20:32:12.3622804Z op = torch.compile(op) 2025-05-07T20:32:12.3623106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:12.3623381Z 2025-05-07T20:32:12.3623586Z y_fp8, y_scale = fn() 2025-05-07T20:32:12.3623873Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:12.3624163Z 2025-05-07T20:32:12.3624407Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:12.3624749Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:12.3625041Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:12.3625357Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:12.3625799Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3626119Z 2025-05-07T20:32:12.3626320Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:12.3626522Z 2025-05-07T20:32:12.3626630Z moe/activation_test.py:126: 2025-05-07T20:32:12.3626931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3627266Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:12.3627599Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:12.3628383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:12.3629132Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:12.3629674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:12.3630360Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:12.3631052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:12.3631768Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:12.3632497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:12.3633137Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:12.3633744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:12.3634258Z fn() 2025-05-07T20:32:12.3634766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:12.3635345Z self.fn.run( 2025-05-07T20:32:12.3635812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:12.3636340Z kernel = self.compile( 2025-05-07T20:32:12.3636882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:12.3637534Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.3637932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:12.3638170Z 2025-05-07T20:32:12.3638383Z self = 2025-05-07T20:32:12.3639502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:12.3640864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60117f60>} 2025-05-07T20:32:12.3642195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:12.3643303Z context = 2025-05-07T20:32:12.3643593Z 2025-05-07T20:32:12.3643760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:12.3644279Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.3644743Z module_map=module_map) 2025-05-07T20:32:12.3645110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.3645472Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:12.3645740Z E ^ 2025-05-07T20:32:12.3646278Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.3646725Z 2025-05-07T20:32:12.3647136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:12.5812496Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:12.5813806Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:12.5815131Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:12.5816932Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:12.5818140Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:12.5819763Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:12.5821472Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.5822954Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:12.5824314Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.5825350Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:12.5826606Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:12.5827840Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:12.5828679Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:12.5829879Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:12.5831074Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:12.5832503Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:12.5833510Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:12.5834716Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:12.5836179Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:12.5837084Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:12.5838167Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:12.5839192Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:12.5839970Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:12.5841136Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:12.5842476Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:12.5843535Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.5844440Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.5845185Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:12.5846201Z W0507 20:32:12.577000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:12.6437461Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:12.6438925Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:32:12.6440477Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:12.6441895Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:12.6442871Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:12.6444160Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:12.6445889Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:12.6447168Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:12.6448525Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:12.6449702Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] module_map=module_map) 2025-05-07T20:32:12.6450948Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:12.6452186Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:32:12.6453022Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:12.6454343Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:12.6455544Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:32:12.6456569Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:12.6457585Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:32:12.6458781Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:12.6460045Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:12.6460950Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:12.6462028Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:12.6463065Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:32:12.6463823Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:12.6464979Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:12.6466325Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:12.6467379Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:12.6468377Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:12.6469121Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:32:12.6470134Z W0507 20:32:12.640000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1288281Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:13.1290350Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:13.1292963Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:13.1295939Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:13.1297858Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:13.1299658Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:13.1301016Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1302304Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:13.1303652Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1304686Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:13.1305926Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:13.1307161Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:13.1307997Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.1309189Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:13.1310380Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:13.1311403Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:13.1312409Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:13.1313788Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:13.1315047Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:13.1315940Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.1317113Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:13.1318141Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:13.1318907Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:13.1320060Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:13.1321390Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:13.1322436Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1323342Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1324081Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:13.1325083Z W0507 20:32:13.125000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.1906662Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:13.1907704Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:32:13.1909093Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:13.1910503Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:13.1911469Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:13.1912750Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:13.1914103Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.1915376Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:13.1917070Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.1924099Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] module_map=module_map) 2025-05-07T20:32:13.1925574Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:13.1926816Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:32:13.1927662Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.1928855Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:13.1930048Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:32:13.1931069Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:13.1932235Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:32:13.1933442Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:13.1934824Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:13.1935718Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.1936800Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:13.1937832Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:32:13.1938600Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:13.1939760Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:13.1941088Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:13.1942146Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.1943058Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.1943802Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:32:13.1944817Z W0507 20:32:13.186000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4617966Z 2025-05-07T20:32:13.4618199Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4618683Z self=, 2025-05-07T20:32:13.4619275Z T=2048, 2025-05-07T20:32:13.4619485Z D=5120, 2025-05-07T20:32:13.4619690Z scale_ub=None, 2025-05-07T20:32:13.4619905Z contiguous=True, 2025-05-07T20:32:13.4620137Z compiled=True, 2025-05-07T20:32:13.4620358Z ) 2025-05-07T20:32:13.4620681Z self = 2025-05-07T20:32:13.4621500Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.4621777Z 2025-05-07T20:32:13.4621866Z @given( 2025-05-07T20:32:13.4622108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4622431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4622744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4623079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4623406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4623701Z ) 2025-05-07T20:32:13.4624054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4624497Z def test_silu_mul_quant( 2025-05-07T20:32:13.4624754Z self, 2025-05-07T20:32:13.4624956Z T: int, 2025-05-07T20:32:13.4625153Z D: int, 2025-05-07T20:32:13.4625377Z scale_ub: Optional[float], 2025-05-07T20:32:13.4625652Z contiguous: bool, 2025-05-07T20:32:13.4625896Z compiled: bool, 2025-05-07T20:32:13.4626130Z ) -> None: 2025-05-07T20:32:13.4626352Z torch.manual_seed(2025) 2025-05-07T20:32:13.4626590Z 2025-05-07T20:32:13.4626863Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4627215Z 2025-05-07T20:32:13.4627410Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4627705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4628018Z x = x_sign * x_clamp 2025-05-07T20:32:13.4628264Z x0 = x[:, :D] 2025-05-07T20:32:13.4628480Z x1 = x[:, D:] 2025-05-07T20:32:13.4628695Z 2025-05-07T20:32:13.4628912Z if contiguous: 2025-05-07T20:32:13.4629165Z x0 = x0.contiguous() 2025-05-07T20:32:13.4629426Z x1 = x1.contiguous() 2025-05-07T20:32:13.4629671Z 2025-05-07T20:32:13.4629859Z if scale_ub is not None: 2025-05-07T20:32:13.4630134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4630477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4630779Z ) 2025-05-07T20:32:13.4630979Z else: 2025-05-07T20:32:13.4631195Z scale_ub_tensor = None 2025-05-07T20:32:13.4631441Z 2025-05-07T20:32:13.4631679Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4631999Z op = silu_mul_quant 2025-05-07T20:32:13.4632246Z if compiled: 2025-05-07T20:32:13.4632496Z op = torch.compile(op) 2025-05-07T20:32:13.4632795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4633074Z 2025-05-07T20:32:13.4633261Z y_fp8, y_scale = fn() 2025-05-07T20:32:13.4633548Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:13.4633844Z 2025-05-07T20:32:13.4634077Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4634415Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:13.4634711Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:13.4635026Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:13.4635398Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.4635710Z 2025-05-07T20:32:13.4635911Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:13.4636292Z 2025-05-07T20:32:13.4636393Z moe/activation_test.py:126: 2025-05-07T20:32:13.4636694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4637040Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:13.4637362Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.4638151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:13.4638903Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:13.4639442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4640204Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4642245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:13.4642975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.4643686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:13.4644324Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:13.4644921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:13.4645441Z fn() 2025-05-07T20:32:13.4645937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:13.4646517Z self.fn.run( 2025-05-07T20:32:13.4646991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4647512Z kernel = self.compile( 2025-05-07T20:32:13.4648052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4648714Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4649116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4649343Z 2025-05-07T20:32:13.4649551Z self = 2025-05-07T20:32:13.4650630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4652001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae28cc0>} 2025-05-07T20:32:13.4653329Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4654497Z context = 2025-05-07T20:32:13.4654787Z 2025-05-07T20:32:13.4654952Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4655469Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4655933Z module_map=module_map) 2025-05-07T20:32:13.4656297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4656654Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:13.4656926Z E ^ 2025-05-07T20:32:13.4657387Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4657833Z 2025-05-07T20:32:13.4658244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.4658855Z 2025-05-07T20:32:13.4658958Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:13.4659368Z self=, 2025-05-07T20:32:13.4659761Z T=128, 2025-05-07T20:32:13.4659951Z D=5120, 2025-05-07T20:32:13.4660143Z scale_ub=None, 2025-05-07T20:32:13.4660357Z contiguous=True, 2025-05-07T20:32:13.4660584Z compiled=True, 2025-05-07T20:32:13.4660789Z ) 2025-05-07T20:32:13.4661100Z self = 2025-05-07T20:32:13.4661588Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:13.4661856Z 2025-05-07T20:32:13.4661939Z @given( 2025-05-07T20:32:13.4662253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:13.4662563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:13.4662871Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:13.4663210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:13.4663533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:13.4663821Z ) 2025-05-07T20:32:13.4664171Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:13.4664617Z def test_silu_mul_quant( 2025-05-07T20:32:13.4664861Z self, 2025-05-07T20:32:13.4665060Z T: int, 2025-05-07T20:32:13.4665263Z D: int, 2025-05-07T20:32:13.4665478Z scale_ub: Optional[float], 2025-05-07T20:32:13.4665754Z contiguous: bool, 2025-05-07T20:32:13.4665995Z compiled: bool, 2025-05-07T20:32:13.4666216Z ) -> None: 2025-05-07T20:32:13.4666439Z torch.manual_seed(2025) 2025-05-07T20:32:13.4666714Z 2025-05-07T20:32:13.4666997Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:13.4667336Z 2025-05-07T20:32:13.4667534Z x_sign = torch.sign(x) 2025-05-07T20:32:13.4667826Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:13.4668136Z x = x_sign * x_clamp 2025-05-07T20:32:13.4668373Z x0 = x[:, :D] 2025-05-07T20:32:13.4668593Z x1 = x[:, D:] 2025-05-07T20:32:13.4668806Z 2025-05-07T20:32:13.4668994Z if contiguous: 2025-05-07T20:32:13.4669225Z x0 = x0.contiguous() 2025-05-07T20:32:13.4669480Z x1 = x1.contiguous() 2025-05-07T20:32:13.4669722Z 2025-05-07T20:32:13.4669919Z if scale_ub is not None: 2025-05-07T20:32:13.4670186Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:13.4670519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:13.4670835Z ) 2025-05-07T20:32:13.4671031Z else: 2025-05-07T20:32:13.4671243Z scale_ub_tensor = None 2025-05-07T20:32:13.4671499Z 2025-05-07T20:32:13.4671732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4672041Z op = silu_mul_quant 2025-05-07T20:32:13.4672300Z if compiled: 2025-05-07T20:32:13.4672547Z op = torch.compile(op) 2025-05-07T20:32:13.4672839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:13.4673117Z 2025-05-07T20:32:13.4673311Z y_fp8, y_scale = fn() 2025-05-07T20:32:13.4673589Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:13.4673883Z 2025-05-07T20:32:13.4674125Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:13.4674454Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:13.4674743Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:13.4675058Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:13.4675424Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.4675726Z 2025-05-07T20:32:13.4675932Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:13.4676122Z 2025-05-07T20:32:13.4676230Z moe/activation_test.py:126: 2025-05-07T20:32:13.4676521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4677010Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:13.4677331Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:13.4678109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:13.4678853Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:13.4679390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:13.4680067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:13.4680831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:13.4681552Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:13.4682267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:13.4682901Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:13.4683503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:13.4684022Z fn() 2025-05-07T20:32:13.4684520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:13.4685096Z self.fn.run( 2025-05-07T20:32:13.4685561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:13.4686079Z kernel = self.compile( 2025-05-07T20:32:13.4686625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:13.4687274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.4687675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:13.4687899Z 2025-05-07T20:32:13.4688104Z self = 2025-05-07T20:32:13.4689219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:13.4690570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae48f40>} 2025-05-07T20:32:13.4691900Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:13.4692915Z context = 2025-05-07T20:32:13.4693203Z 2025-05-07T20:32:13.4693367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:13.4693970Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.4694435Z module_map=module_map) 2025-05-07T20:32:13.4694792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.4695149Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:13.4695419Z E ^ 2025-05-07T20:32:13.4695879Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.4696321Z 2025-05-07T20:32:13.4696736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:13.6940964Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:13.6942436Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:13.6943750Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:13.6945155Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:13.6946260Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:13.6947542Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:13.6948902Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.6950180Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:13.6951533Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.6952558Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:13.6953808Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:13.6955034Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:13.6955868Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.6957054Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:13.6958238Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:13.6959260Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:13.6960266Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:13.6961464Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:13.6962732Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:13.6963618Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.6964782Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:13.6965805Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:13.6966565Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:13.6967780Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:13.6969162Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:13.6970213Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.6971115Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.6971852Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:13.6972849Z W0507 20:32:13.690000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:13.7567623Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:13.7569479Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:32:13.7570826Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:13.7572261Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:13.7573247Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:13.7574691Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:13.7576066Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:13.7577356Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:13.7578714Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:13.7579753Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] module_map=module_map) 2025-05-07T20:32:13.7580992Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:13.7582554Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:32:13.7583398Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.7584587Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:13.7585950Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:32:13.7586971Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:13.7587989Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:32:13.7589198Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:13.7590464Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:13.7591365Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:13.7592441Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:13.7593478Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:32:13.7594245Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:13.7595400Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:13.7596738Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:13.7597791Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:13.7599113Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:13.7599856Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:32:13.7600853Z W0507 20:32:13.753000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.2972082Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:14.2973263Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:14.2974703Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:14.2976591Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:14.2977556Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:14.2979007Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:14.2980378Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.2981674Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:14.2983031Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.2984058Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:14.2985308Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:14.2986540Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:14.2987382Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:14.2988569Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:14.2989754Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:14.2990785Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:14.2991796Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:14.2993011Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:14.2994275Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:14.2995164Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:14.2996246Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:14.2997279Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:14.2998123Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:14.2999667Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:14.3001008Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:14.3002058Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.3003083Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.3003826Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:14.3004838Z W0507 20:32:14.293000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.3596960Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:14.3598464Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:32:14.3599867Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:14.3601288Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:14.3602271Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ 2025-05-07T20:32:14.3603552Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:14.3604920Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.3606210Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:14.3607578Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.3608613Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] module_map=module_map) 2025-05-07T20:32:14.3609850Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:14.3611094Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:32:14.3611932Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:14.3613442Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:14.3614755Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:32:14.3615772Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit 2025-05-07T20:32:14.3616935Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:32:14.3618137Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:14.3619404Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:14.3620293Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:14.3621363Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit 2025-05-07T20:32:14.3622398Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:32:14.3623160Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ~~~~~~~~~~^^^^^^ 2025-05-07T20:32:14.3624316Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:14.3625658Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:14.3626710Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.3627613Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:14.3628350Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:32:14.3629363Z W0507 20:32:14.356000 276022 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6673670Z 2025-05-07T20:32:14.6674344Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6675025Z self=, 2025-05-07T20:32:14.6675566Z T=4096, 2025-05-07T20:32:14.6675825Z D=5120, 2025-05-07T20:32:14.6676087Z scale_ub=None, 2025-05-07T20:32:14.6676378Z contiguous=True, 2025-05-07T20:32:14.6676605Z compiled=True, 2025-05-07T20:32:14.6676814Z ) 2025-05-07T20:32:14.6677132Z self = 2025-05-07T20:32:14.6677622Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6677912Z 2025-05-07T20:32:14.6678003Z @given( 2025-05-07T20:32:14.6678229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6678545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6678853Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6679559Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6679878Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6680165Z ) 2025-05-07T20:32:14.6680510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6680946Z def test_silu_mul_quant( 2025-05-07T20:32:14.6681190Z self, 2025-05-07T20:32:14.6687692Z T: int, 2025-05-07T20:32:14.6687959Z D: int, 2025-05-07T20:32:14.6688180Z scale_ub: Optional[float], 2025-05-07T20:32:14.6688456Z contiguous: bool, 2025-05-07T20:32:14.6688687Z compiled: bool, 2025-05-07T20:32:14.6688920Z ) -> None: 2025-05-07T20:32:14.6689345Z torch.manual_seed(2025) 2025-05-07T20:32:14.6689595Z 2025-05-07T20:32:14.6689868Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6690217Z 2025-05-07T20:32:14.6690402Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6690700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6691022Z x = x_sign * x_clamp 2025-05-07T20:32:14.6691254Z x0 = x[:, :D] 2025-05-07T20:32:14.6691471Z x1 = x[:, D:] 2025-05-07T20:32:14.6691679Z 2025-05-07T20:32:14.6691862Z if contiguous: 2025-05-07T20:32:14.6692084Z x0 = x0.contiguous() 2025-05-07T20:32:14.6692341Z x1 = x1.contiguous() 2025-05-07T20:32:14.6692583Z 2025-05-07T20:32:14.6692770Z if scale_ub is not None: 2025-05-07T20:32:14.6693042Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6693381Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6693835Z ) 2025-05-07T20:32:14.6694042Z else: 2025-05-07T20:32:14.6694269Z scale_ub_tensor = None 2025-05-07T20:32:14.6694526Z 2025-05-07T20:32:14.6694770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6695095Z op = silu_mul_quant 2025-05-07T20:32:14.6695336Z if compiled: 2025-05-07T20:32:14.6695591Z op = torch.compile(op) 2025-05-07T20:32:14.6695886Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6696155Z 2025-05-07T20:32:14.6696352Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6696635Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6696929Z 2025-05-07T20:32:14.6697158Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6697491Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6697784Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6698088Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6698807Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6699118Z 2025-05-07T20:32:14.6699311Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6699511Z 2025-05-07T20:32:14.6699612Z moe/activation_test.py:126: 2025-05-07T20:32:14.6699908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6700236Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6700558Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6702834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6703573Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6704110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6704789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6705462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6706174Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6707075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6707701Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6708287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6708801Z fn() 2025-05-07T20:32:14.6709299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6709866Z self.fn.run( 2025-05-07T20:32:14.6710447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6710969Z kernel = self.compile( 2025-05-07T20:32:14.6711500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6712133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6712532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6712754Z 2025-05-07T20:32:14.6712964Z self = 2025-05-07T20:32:14.6714027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6715383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae4b1a0>} 2025-05-07T20:32:14.6716708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6717716Z context = 2025-05-07T20:32:14.6717996Z 2025-05-07T20:32:14.6718163Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6718668Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6719132Z module_map=module_map) 2025-05-07T20:32:14.6719491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6719840Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6720100Z E ^ 2025-05-07T20:32:14.6720556Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6721000Z 2025-05-07T20:32:14.6721413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6721912Z 2025-05-07T20:32:14.6722014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:14.6722421Z self=, 2025-05-07T20:32:14.6722816Z T=16384, 2025-05-07T20:32:14.6723009Z D=5120, 2025-05-07T20:32:14.6723196Z scale_ub=None, 2025-05-07T20:32:14.6723409Z contiguous=True, 2025-05-07T20:32:14.6723630Z compiled=True, 2025-05-07T20:32:14.6723828Z ) 2025-05-07T20:32:14.6724142Z self = 2025-05-07T20:32:14.6724627Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:14.6724895Z 2025-05-07T20:32:14.6724972Z @given( 2025-05-07T20:32:14.6725202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:14.6725514Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:14.6725813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:14.6726139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:14.6726555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:14.6726840Z ) 2025-05-07T20:32:14.6727179Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:14.6727618Z def test_silu_mul_quant( 2025-05-07T20:32:14.6727858Z self, 2025-05-07T20:32:14.6728046Z T: int, 2025-05-07T20:32:14.6728244Z D: int, 2025-05-07T20:32:14.6728460Z scale_ub: Optional[float], 2025-05-07T20:32:14.6728724Z contiguous: bool, 2025-05-07T20:32:14.6728962Z compiled: bool, 2025-05-07T20:32:14.6729209Z ) -> None: 2025-05-07T20:32:14.6729442Z torch.manual_seed(2025) 2025-05-07T20:32:14.6729688Z 2025-05-07T20:32:14.6730044Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:14.6730382Z 2025-05-07T20:32:14.6730578Z x_sign = torch.sign(x) 2025-05-07T20:32:14.6730861Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:14.6731161Z x = x_sign * x_clamp 2025-05-07T20:32:14.6731395Z x0 = x[:, :D] 2025-05-07T20:32:14.6731606Z x1 = x[:, D:] 2025-05-07T20:32:14.6731807Z 2025-05-07T20:32:14.6731979Z if contiguous: 2025-05-07T20:32:14.6732206Z x0 = x0.contiguous() 2025-05-07T20:32:14.6732455Z x1 = x1.contiguous() 2025-05-07T20:32:14.6732682Z 2025-05-07T20:32:14.6732869Z if scale_ub is not None: 2025-05-07T20:32:14.6733139Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:14.6733459Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:14.6733882Z ) 2025-05-07T20:32:14.6734070Z else: 2025-05-07T20:32:14.6734268Z scale_ub_tensor = None 2025-05-07T20:32:14.6734518Z 2025-05-07T20:32:14.6734749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6735051Z op = silu_mul_quant 2025-05-07T20:32:14.6735297Z if compiled: 2025-05-07T20:32:14.6735537Z op = torch.compile(op) 2025-05-07T20:32:14.6735827Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:14.6736097Z 2025-05-07T20:32:14.6736284Z y_fp8, y_scale = fn() 2025-05-07T20:32:14.6736560Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:14.6736837Z 2025-05-07T20:32:14.6737093Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:14.6737418Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:14.6737698Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:14.6738003Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:14.6738353Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6738660Z 2025-05-07T20:32:14.6738856Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:14.6739052Z 2025-05-07T20:32:14.6739148Z moe/activation_test.py:126: 2025-05-07T20:32:14.6739440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6739768Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:14.6740090Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:14.6740862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:14.6741594Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:14.6742127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:14.6742794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:14.6743478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:14.6744181Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:14.6744902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:14.6745623Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:14.6746214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:14.6746720Z fn() 2025-05-07T20:32:14.6747222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:14.6747796Z self.fn.run( 2025-05-07T20:32:14.6748254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:14.6748766Z kernel = self.compile( 2025-05-07T20:32:14.6749405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:14.6750048Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:14.6750433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:14.6750669Z 2025-05-07T20:32:14.6750872Z self = 2025-05-07T20:32:14.6751940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:14.6753287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5bce8e0>} 2025-05-07T20:32:14.6754611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:14.6755756Z context = 2025-05-07T20:32:14.6756045Z 2025-05-07T20:32:14.6756214Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:14.6756726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:14.6757187Z module_map=module_map) 2025-05-07T20:32:14.6757542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:14.6757895Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:14.6758157Z E ^ 2025-05-07T20:32:14.6758604Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:14.6759048Z 2025-05-07T20:32:14.6759458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:14.6959153Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:14.6960440Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:14.6961766Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:14.6962732Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:14.6963826Z W0507 20:32:14.694000 276022 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:15.1199085Z 2025-05-07T20:32:15.1199703Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1200319Z self=, 2025-05-07T20:32:15.1201294Z T=1, 2025-05-07T20:32:15.1201493Z D=5120, 2025-05-07T20:32:15.1201687Z scale_ub=1200.0, 2025-05-07T20:32:15.1201914Z contiguous=True, 2025-05-07T20:32:15.1202136Z compiled=True, 2025-05-07T20:32:15.1202341Z ) 2025-05-07T20:32:15.1202664Z self = 2025-05-07T20:32:15.1203155Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.1203410Z 2025-05-07T20:32:15.1203488Z @given( 2025-05-07T20:32:15.1203720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1204036Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1204336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1204829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1205162Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1205452Z ) 2025-05-07T20:32:15.1205792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1206238Z def test_silu_mul_quant( 2025-05-07T20:32:15.1206485Z self, 2025-05-07T20:32:15.1206677Z T: int, 2025-05-07T20:32:15.1206873Z D: int, 2025-05-07T20:32:15.1207088Z scale_ub: Optional[float], 2025-05-07T20:32:15.1207352Z contiguous: bool, 2025-05-07T20:32:15.1207591Z compiled: bool, 2025-05-07T20:32:15.1207818Z ) -> None: 2025-05-07T20:32:15.1208026Z torch.manual_seed(2025) 2025-05-07T20:32:15.1208265Z 2025-05-07T20:32:15.1208538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1208870Z 2025-05-07T20:32:15.1209065Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1209402Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1209720Z x = x_sign * x_clamp 2025-05-07T20:32:15.1209956Z x0 = x[:, :D] 2025-05-07T20:32:15.1210204Z x1 = x[:, D:] 2025-05-07T20:32:15.1210416Z 2025-05-07T20:32:15.1210613Z if contiguous: 2025-05-07T20:32:15.1210843Z x0 = x0.contiguous() 2025-05-07T20:32:15.1211105Z x1 = x1.contiguous() 2025-05-07T20:32:15.1211347Z 2025-05-07T20:32:15.1211540Z if scale_ub is not None: 2025-05-07T20:32:15.1211815Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1212147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1212461Z ) 2025-05-07T20:32:15.1212653Z else: 2025-05-07T20:32:15.1212867Z scale_ub_tensor = None 2025-05-07T20:32:15.1213122Z 2025-05-07T20:32:15.1213349Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1213820Z op = silu_mul_quant 2025-05-07T20:32:15.1214077Z if compiled: 2025-05-07T20:32:15.1214325Z op = torch.compile(op) 2025-05-07T20:32:15.1214623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1214898Z 2025-05-07T20:32:15.1215094Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.1215265Z 2025-05-07T20:32:15.1215366Z moe/activation_test.py:117: 2025-05-07T20:32:15.1215661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1215994Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.1216269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1216827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.1217394Z return fn(*args, **kwargs) 2025-05-07T20:32:15.1218037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.1218722Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.1219255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1219929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1220668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1221201Z kernel = self.compile( 2025-05-07T20:32:15.1221739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1222381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1222777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1223011Z 2025-05-07T20:32:15.1223216Z self = 2025-05-07T20:32:15.1224351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1225715Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f4400>} 2025-05-07T20:32:15.1227038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1228046Z context = 2025-05-07T20:32:15.1228330Z 2025-05-07T20:32:15.1228499Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1229012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1229475Z module_map=module_map) 2025-05-07T20:32:15.1229839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1230192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.1230445Z E ^ 2025-05-07T20:32:15.1230907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1231347Z 2025-05-07T20:32:15.1231759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.1232261Z 2025-05-07T20:32:15.1232370Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.1232771Z self=, 2025-05-07T20:32:15.1233169Z T=1, 2025-05-07T20:32:15.1233351Z D=5120, 2025-05-07T20:32:15.1233540Z scale_ub=None, 2025-05-07T20:32:15.1233758Z contiguous=False, 2025-05-07T20:32:15.1233983Z compiled=True, 2025-05-07T20:32:15.1234182Z ) 2025-05-07T20:32:15.1234502Z self = 2025-05-07T20:32:15.1234979Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.1235232Z 2025-05-07T20:32:15.1235319Z @given( 2025-05-07T20:32:15.1235544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.1235864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.1236168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.1236491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.1236819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.1237107Z ) 2025-05-07T20:32:15.1237449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.1237889Z def test_silu_mul_quant( 2025-05-07T20:32:15.1238132Z self, 2025-05-07T20:32:15.1238323Z T: int, 2025-05-07T20:32:15.1238524Z D: int, 2025-05-07T20:32:15.1238750Z scale_ub: Optional[float], 2025-05-07T20:32:15.1239023Z contiguous: bool, 2025-05-07T20:32:15.1239259Z compiled: bool, 2025-05-07T20:32:15.1239483Z ) -> None: 2025-05-07T20:32:15.1239701Z torch.manual_seed(2025) 2025-05-07T20:32:15.1240034Z 2025-05-07T20:32:15.1240306Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.1240649Z 2025-05-07T20:32:15.1240838Z x_sign = torch.sign(x) 2025-05-07T20:32:15.1241127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.1241438Z x = x_sign * x_clamp 2025-05-07T20:32:15.1241676Z x0 = x[:, :D] 2025-05-07T20:32:15.1241892Z x1 = x[:, D:] 2025-05-07T20:32:15.1242101Z 2025-05-07T20:32:15.1242282Z if contiguous: 2025-05-07T20:32:15.1242517Z x0 = x0.contiguous() 2025-05-07T20:32:15.1242773Z x1 = x1.contiguous() 2025-05-07T20:32:15.1243006Z 2025-05-07T20:32:15.1243280Z if scale_ub is not None: 2025-05-07T20:32:15.1243560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.1243887Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.1244196Z ) 2025-05-07T20:32:15.1244394Z else: 2025-05-07T20:32:15.1244609Z scale_ub_tensor = None 2025-05-07T20:32:15.1244857Z 2025-05-07T20:32:15.1245088Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1245402Z op = silu_mul_quant 2025-05-07T20:32:15.1245645Z if compiled: 2025-05-07T20:32:15.1245890Z op = torch.compile(op) 2025-05-07T20:32:15.1246185Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.1246456Z 2025-05-07T20:32:15.1246651Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.1246933Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.1247221Z 2025-05-07T20:32:15.1247459Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.1247803Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.1248091Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.1248402Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.1248765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.1249078Z 2025-05-07T20:32:15.1249277Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.1249476Z 2025-05-07T20:32:15.1249575Z moe/activation_test.py:126: 2025-05-07T20:32:15.1249871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1250201Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.1250527Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.1251301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.1252044Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.1252585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.1253257Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.1254039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.1254747Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.1255465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.1256097Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.1256697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.1257209Z fn() 2025-05-07T20:32:15.1257720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.1258294Z self.fn.run( 2025-05-07T20:32:15.1258755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.1259402Z kernel = self.compile( 2025-05-07T20:32:15.1259942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.1260587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.1260976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.1261208Z 2025-05-07T20:32:15.1261414Z self = 2025-05-07T20:32:15.1262550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.1263899Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9ee020>} 2025-05-07T20:32:15.1265226Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.1266227Z context = 2025-05-07T20:32:15.1266521Z 2025-05-07T20:32:15.1266684Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.1267198Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.1267654Z module_map=module_map) 2025-05-07T20:32:15.1268016Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.1268373Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.1268638Z E ^ 2025-05-07T20:32:15.1269090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.1269591Z 2025-05-07T20:32:15.1269998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2704010Z 2025-05-07T20:32:15.2704425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2705059Z self=, 2025-05-07T20:32:15.2705587Z T=1, 2025-05-07T20:32:15.2705783Z D=5120, 2025-05-07T20:32:15.2705988Z scale_ub=None, 2025-05-07T20:32:15.2706208Z contiguous=True, 2025-05-07T20:32:15.2706435Z compiled=False, 2025-05-07T20:32:15.2706656Z ) 2025-05-07T20:32:15.2706984Z self = 2025-05-07T20:32:15.2707699Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:15.2714728Z 2025-05-07T20:32:15.2714826Z @given( 2025-05-07T20:32:15.2715080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2715403Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2715720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2716052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2716382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2716664Z ) 2025-05-07T20:32:15.2717016Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2717463Z def test_silu_mul_quant( 2025-05-07T20:32:15.2717701Z self, 2025-05-07T20:32:15.2717901Z T: int, 2025-05-07T20:32:15.2718098Z D: int, 2025-05-07T20:32:15.2718309Z scale_ub: Optional[float], 2025-05-07T20:32:15.2718579Z contiguous: bool, 2025-05-07T20:32:15.2718818Z compiled: bool, 2025-05-07T20:32:15.2719084Z ) -> None: 2025-05-07T20:32:15.2719308Z torch.manual_seed(2025) 2025-05-07T20:32:15.2719581Z 2025-05-07T20:32:15.2719858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2720570Z 2025-05-07T20:32:15.2720755Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2721042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2721353Z x = x_sign * x_clamp 2025-05-07T20:32:15.2721589Z x0 = x[:, :D] 2025-05-07T20:32:15.2721796Z x1 = x[:, D:] 2025-05-07T20:32:15.2722006Z 2025-05-07T20:32:15.2722186Z if contiguous: 2025-05-07T20:32:15.2722407Z x0 = x0.contiguous() 2025-05-07T20:32:15.2722662Z x1 = x1.contiguous() 2025-05-07T20:32:15.2722902Z 2025-05-07T20:32:15.2723090Z if scale_ub is not None: 2025-05-07T20:32:15.2723362Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2723845Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2724154Z ) 2025-05-07T20:32:15.2724350Z else: 2025-05-07T20:32:15.2724561Z scale_ub_tensor = None 2025-05-07T20:32:15.2724802Z 2025-05-07T20:32:15.2725031Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2725353Z op = silu_mul_quant 2025-05-07T20:32:15.2725594Z if compiled: 2025-05-07T20:32:15.2725840Z op = torch.compile(op) 2025-05-07T20:32:15.2726131Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2726402Z 2025-05-07T20:32:15.2726582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2726747Z 2025-05-07T20:32:15.2726839Z moe/activation_test.py:117: 2025-05-07T20:32:15.2727134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2727461Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2727740Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2728423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2729103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2729634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2730309Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2730962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2731489Z kernel = self.compile( 2025-05-07T20:32:15.2732023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2732665Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2733061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2733287Z 2025-05-07T20:32:15.2733503Z self = 2025-05-07T20:32:15.2734723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2736076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60114400>} 2025-05-07T20:32:15.2737393Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2738398Z context = 2025-05-07T20:32:15.2738676Z 2025-05-07T20:32:15.2738851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2739352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2739806Z module_map=module_map) 2025-05-07T20:32:15.2740290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2740638Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2740889Z E ^ 2025-05-07T20:32:15.2741344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2741777Z 2025-05-07T20:32:15.2742188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2742689Z 2025-05-07T20:32:15.2742795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2743194Z self=, 2025-05-07T20:32:15.2743589Z T=128, 2025-05-07T20:32:15.2743776Z D=5120, 2025-05-07T20:32:15.2744036Z scale_ub=None, 2025-05-07T20:32:15.2744254Z contiguous=False, 2025-05-07T20:32:15.2744476Z compiled=True, 2025-05-07T20:32:15.2744671Z ) 2025-05-07T20:32:15.2744982Z self = 2025-05-07T20:32:15.2745472Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.2745733Z 2025-05-07T20:32:15.2745805Z @given( 2025-05-07T20:32:15.2746029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.2746338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.2746639Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.2746958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.2747281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.2747566Z ) 2025-05-07T20:32:15.2747903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.2748345Z def test_silu_mul_quant( 2025-05-07T20:32:15.2748583Z self, 2025-05-07T20:32:15.2748767Z T: int, 2025-05-07T20:32:15.2748962Z D: int, 2025-05-07T20:32:15.2749175Z scale_ub: Optional[float], 2025-05-07T20:32:15.2749443Z contiguous: bool, 2025-05-07T20:32:15.2749678Z compiled: bool, 2025-05-07T20:32:15.2749903Z ) -> None: 2025-05-07T20:32:15.2750107Z torch.manual_seed(2025) 2025-05-07T20:32:15.2750343Z 2025-05-07T20:32:15.2750612Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.2750941Z 2025-05-07T20:32:15.2751130Z x_sign = torch.sign(x) 2025-05-07T20:32:15.2751417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.2751726Z x = x_sign * x_clamp 2025-05-07T20:32:15.2751956Z x0 = x[:, :D] 2025-05-07T20:32:15.2752172Z x1 = x[:, D:] 2025-05-07T20:32:15.2752376Z 2025-05-07T20:32:15.2752550Z if contiguous: 2025-05-07T20:32:15.2752783Z x0 = x0.contiguous() 2025-05-07T20:32:15.2753037Z x1 = x1.contiguous() 2025-05-07T20:32:15.2753263Z 2025-05-07T20:32:15.2753450Z if scale_ub is not None: 2025-05-07T20:32:15.2753717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.2754040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.2754342Z ) 2025-05-07T20:32:15.2754534Z else: 2025-05-07T20:32:15.2754730Z scale_ub_tensor = None 2025-05-07T20:32:15.2754977Z 2025-05-07T20:32:15.2755205Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.2755506Z op = silu_mul_quant 2025-05-07T20:32:15.2755750Z if compiled: 2025-05-07T20:32:15.2755989Z op = torch.compile(op) 2025-05-07T20:32:15.2756275Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2756540Z 2025-05-07T20:32:15.2756730Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.2756888Z 2025-05-07T20:32:15.2756998Z moe/activation_test.py:117: 2025-05-07T20:32:15.2757282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2757606Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.2757975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.2758516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.2759063Z return fn(*args, **kwargs) 2025-05-07T20:32:15.2759710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.2760381Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.2760901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.2761571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.2762868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.2763391Z kernel = self.compile( 2025-05-07T20:32:15.2763922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.2764576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.2764967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.2765189Z 2025-05-07T20:32:15.2765392Z self = 2025-05-07T20:32:15.2766451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.2767803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5d91ee0>} 2025-05-07T20:32:15.2769126Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.2770131Z context = 2025-05-07T20:32:15.2770412Z 2025-05-07T20:32:15.2770575Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.2771086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.2771546Z module_map=module_map) 2025-05-07T20:32:15.2771896Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.2772245Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.2772504Z E ^ 2025-05-07T20:32:15.2772971Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.2773409Z 2025-05-07T20:32:15.2773912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.2774426Z 2025-05-07T20:32:15.2774528Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.2774933Z self=, 2025-05-07T20:32:15.2775326Z T=128, 2025-05-07T20:32:15.2775504Z D=7168, 2025-05-07T20:32:15.2775690Z scale_ub=1200.0, 2025-05-07T20:32:15.2775904Z contiguous=False, 2025-05-07T20:32:15.2776119Z compiled=False, 2025-05-07T20:32:15.4345536Z ) 2025-05-07T20:32:15.4346038Z self = 2025-05-07T20:32:15.4346809Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.4347189Z 2025-05-07T20:32:15.4347302Z @given( 2025-05-07T20:32:15.4347624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4347950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4348272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4348908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4349239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4349538Z ) 2025-05-07T20:32:15.4349902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4350349Z def test_silu_mul_quant( 2025-05-07T20:32:15.4350601Z self, 2025-05-07T20:32:15.4350808Z T: int, 2025-05-07T20:32:15.4351011Z D: int, 2025-05-07T20:32:15.4351240Z scale_ub: Optional[float], 2025-05-07T20:32:15.4351521Z contiguous: bool, 2025-05-07T20:32:15.4351774Z compiled: bool, 2025-05-07T20:32:15.4352005Z ) -> None: 2025-05-07T20:32:15.4352244Z torch.manual_seed(2025) 2025-05-07T20:32:15.4352498Z 2025-05-07T20:32:15.4352935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4353287Z 2025-05-07T20:32:15.4353496Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4353801Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4354122Z x = x_sign * x_clamp 2025-05-07T20:32:15.4354374Z x0 = x[:, :D] 2025-05-07T20:32:15.4354604Z x1 = x[:, D:] 2025-05-07T20:32:15.4354815Z 2025-05-07T20:32:15.4355014Z if contiguous: 2025-05-07T20:32:15.4355258Z x0 = x0.contiguous() 2025-05-07T20:32:15.4355527Z x1 = x1.contiguous() 2025-05-07T20:32:15.4355771Z 2025-05-07T20:32:15.4355976Z if scale_ub is not None: 2025-05-07T20:32:15.4356260Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4356601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4356925Z ) 2025-05-07T20:32:15.4357127Z else: 2025-05-07T20:32:15.4357346Z scale_ub_tensor = None 2025-05-07T20:32:15.4357607Z 2025-05-07T20:32:15.4357851Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4358169Z op = silu_mul_quant 2025-05-07T20:32:15.4358469Z if compiled: 2025-05-07T20:32:15.4358720Z op = torch.compile(op) 2025-05-07T20:32:15.4359028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4359311Z 2025-05-07T20:32:15.4359509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4359681Z 2025-05-07T20:32:15.4359784Z moe/activation_test.py:117: 2025-05-07T20:32:15.4360092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4360425Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4360716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4361417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4362115Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4362653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4363339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4364014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4364556Z kernel = self.compile( 2025-05-07T20:32:15.4365101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4365766Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4366175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4366406Z 2025-05-07T20:32:15.4366617Z self = 2025-05-07T20:32:15.4367702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4369165Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b767560>} 2025-05-07T20:32:15.4370508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4371528Z context = 2025-05-07T20:32:15.4371821Z 2025-05-07T20:32:15.4371991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4372597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4373075Z module_map=module_map) 2025-05-07T20:32:15.4373453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4373947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4374222Z E ^ 2025-05-07T20:32:15.4374684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4375126Z 2025-05-07T20:32:15.4375539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4376052Z 2025-05-07T20:32:15.4376161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4376577Z self=, 2025-05-07T20:32:15.4376980Z T=128, 2025-05-07T20:32:15.4377172Z D=5120, 2025-05-07T20:32:15.4377373Z scale_ub=None, 2025-05-07T20:32:15.4377595Z contiguous=False, 2025-05-07T20:32:15.4377827Z compiled=False, 2025-05-07T20:32:15.4378038Z ) 2025-05-07T20:32:15.4378360Z self = 2025-05-07T20:32:15.4378849Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:15.4379129Z 2025-05-07T20:32:15.4379210Z @given( 2025-05-07T20:32:15.4379445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4379760Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4380069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4380401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4380735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4381019Z ) 2025-05-07T20:32:15.4381370Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4381817Z def test_silu_mul_quant( 2025-05-07T20:32:15.4382057Z self, 2025-05-07T20:32:15.4382257Z T: int, 2025-05-07T20:32:15.4382469Z D: int, 2025-05-07T20:32:15.4382687Z scale_ub: Optional[float], 2025-05-07T20:32:15.4382966Z contiguous: bool, 2025-05-07T20:32:15.4383213Z compiled: bool, 2025-05-07T20:32:15.4383441Z ) -> None: 2025-05-07T20:32:15.4383662Z torch.manual_seed(2025) 2025-05-07T20:32:15.4383910Z 2025-05-07T20:32:15.4384180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4384531Z 2025-05-07T20:32:15.4384732Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4385018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4385331Z x = x_sign * x_clamp 2025-05-07T20:32:15.4385574Z x0 = x[:, :D] 2025-05-07T20:32:15.4385797Z x1 = x[:, D:] 2025-05-07T20:32:15.4386004Z 2025-05-07T20:32:15.4386198Z if contiguous: 2025-05-07T20:32:15.4386435Z x0 = x0.contiguous() 2025-05-07T20:32:15.4386695Z x1 = x1.contiguous() 2025-05-07T20:32:15.4386945Z 2025-05-07T20:32:15.4387151Z if scale_ub is not None: 2025-05-07T20:32:15.4387423Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4387766Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4388174Z ) 2025-05-07T20:32:15.4388369Z else: 2025-05-07T20:32:15.4388586Z scale_ub_tensor = None 2025-05-07T20:32:15.4388847Z 2025-05-07T20:32:15.4389079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4389401Z op = silu_mul_quant 2025-05-07T20:32:15.4389659Z if compiled: 2025-05-07T20:32:15.4389907Z op = torch.compile(op) 2025-05-07T20:32:15.4390217Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4390502Z 2025-05-07T20:32:15.4390705Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4390869Z 2025-05-07T20:32:15.4390969Z moe/activation_test.py:117: 2025-05-07T20:32:15.4391353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4391697Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4391978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4392669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4393359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4393897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4394571Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4395236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4395772Z kernel = self.compile( 2025-05-07T20:32:15.4396309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4396972Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4397380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4397608Z 2025-05-07T20:32:15.4397821Z self = 2025-05-07T20:32:15.4399179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4400538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a892a20>} 2025-05-07T20:32:15.4401871Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4402895Z context = 2025-05-07T20:32:15.4403184Z 2025-05-07T20:32:15.4403359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4403882Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4404352Z module_map=module_map) 2025-05-07T20:32:15.4404725Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4405077Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4405349Z E ^ 2025-05-07T20:32:15.4405819Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4406262Z 2025-05-07T20:32:15.4406681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.4407191Z 2025-05-07T20:32:15.4407301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.4407718Z self=, 2025-05-07T20:32:15.4408126Z T=128, 2025-05-07T20:32:15.4408314Z D=5120, 2025-05-07T20:32:15.4408669Z scale_ub=1200.0, 2025-05-07T20:32:15.4408900Z contiguous=True, 2025-05-07T20:32:15.4409121Z compiled=False, 2025-05-07T20:32:15.4409331Z ) 2025-05-07T20:32:15.4409659Z self = 2025-05-07T20:32:15.4410150Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:15.4410429Z 2025-05-07T20:32:15.4410509Z @given( 2025-05-07T20:32:15.4410748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.4411068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.4411375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.4411713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.4412177Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.4412459Z ) 2025-05-07T20:32:15.4412803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.4413245Z def test_silu_mul_quant( 2025-05-07T20:32:15.4413488Z self, 2025-05-07T20:32:15.4413761Z T: int, 2025-05-07T20:32:15.4413958Z D: int, 2025-05-07T20:32:15.4414178Z scale_ub: Optional[float], 2025-05-07T20:32:15.4414450Z contiguous: bool, 2025-05-07T20:32:15.4414689Z compiled: bool, 2025-05-07T20:32:15.4414913Z ) -> None: 2025-05-07T20:32:15.4415120Z torch.manual_seed(2025) 2025-05-07T20:32:15.4415360Z 2025-05-07T20:32:15.4415634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.4415971Z 2025-05-07T20:32:15.4416169Z x_sign = torch.sign(x) 2025-05-07T20:32:15.4416460Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.4416768Z x = x_sign * x_clamp 2025-05-07T20:32:15.4417008Z x0 = x[:, :D] 2025-05-07T20:32:15.4417223Z x1 = x[:, D:] 2025-05-07T20:32:15.4417428Z 2025-05-07T20:32:15.4417613Z if contiguous: 2025-05-07T20:32:15.4417845Z x0 = x0.contiguous() 2025-05-07T20:32:15.4418101Z x1 = x1.contiguous() 2025-05-07T20:32:15.4418340Z 2025-05-07T20:32:15.4418536Z if scale_ub is not None: 2025-05-07T20:32:15.4418805Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.4419141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.4419451Z ) 2025-05-07T20:32:15.4419648Z else: 2025-05-07T20:32:15.4419853Z scale_ub_tensor = None 2025-05-07T20:32:15.4420104Z 2025-05-07T20:32:15.4420331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.4420639Z op = silu_mul_quant 2025-05-07T20:32:15.4420891Z if compiled: 2025-05-07T20:32:15.4421145Z op = torch.compile(op) 2025-05-07T20:32:15.4421434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4421710Z 2025-05-07T20:32:15.4421904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.4422066Z 2025-05-07T20:32:15.4422164Z moe/activation_test.py:117: 2025-05-07T20:32:15.4429173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4429570Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.4429861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.4430554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.4435938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.4436476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.4437165Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.4437833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.4438370Z kernel = self.compile( 2025-05-07T20:32:15.4438908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.4439637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.4440040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.4440269Z 2025-05-07T20:32:15.4440483Z self = 2025-05-07T20:32:15.4441543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.4443009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a892c00>} 2025-05-07T20:32:15.4444334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.4445349Z context = 2025-05-07T20:32:15.4445644Z 2025-05-07T20:32:15.4445811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.4446327Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.4446790Z module_map=module_map) 2025-05-07T20:32:15.4447159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.4447517Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.4447778Z E ^ 2025-05-07T20:32:15.4448247Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.4448697Z 2025-05-07T20:32:15.4449111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.5993577Z 2025-05-07T20:32:15.5994027Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.5994636Z self=, 2025-05-07T20:32:15.5995135Z T=1, 2025-05-07T20:32:15.5995329Z D=7168, 2025-05-07T20:32:15.5995534Z scale_ub=1200.0, 2025-05-07T20:32:15.5995762Z contiguous=True, 2025-05-07T20:32:15.5995981Z compiled=True, 2025-05-07T20:32:15.5996221Z ) 2025-05-07T20:32:15.5996554Z self = 2025-05-07T20:32:15.5997038Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:15.5997311Z 2025-05-07T20:32:15.5997392Z @given( 2025-05-07T20:32:15.5997639Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.5997945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.5998581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.5998987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.5999307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.5999590Z ) 2025-05-07T20:32:15.5999938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6000372Z def test_silu_mul_quant( 2025-05-07T20:32:15.6000612Z self, 2025-05-07T20:32:15.6000809Z T: int, 2025-05-07T20:32:15.6001000Z D: int, 2025-05-07T20:32:15.6001483Z scale_ub: Optional[float], 2025-05-07T20:32:15.6001755Z contiguous: bool, 2025-05-07T20:32:15.6001984Z compiled: bool, 2025-05-07T20:32:15.6002211Z ) -> None: 2025-05-07T20:32:15.6002421Z torch.manual_seed(2025) 2025-05-07T20:32:15.6002657Z 2025-05-07T20:32:15.6002925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6003265Z 2025-05-07T20:32:15.6003449Z x_sign = torch.sign(x) 2025-05-07T20:32:15.6003736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.6004155Z x = x_sign * x_clamp 2025-05-07T20:32:15.6004392Z x0 = x[:, :D] 2025-05-07T20:32:15.6004599Z x1 = x[:, D:] 2025-05-07T20:32:15.6004801Z 2025-05-07T20:32:15.6004984Z if contiguous: 2025-05-07T20:32:15.6005204Z x0 = x0.contiguous() 2025-05-07T20:32:15.6005456Z x1 = x1.contiguous() 2025-05-07T20:32:15.6005691Z 2025-05-07T20:32:15.6005877Z if scale_ub is not None: 2025-05-07T20:32:15.6006145Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.6006474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.6006770Z ) 2025-05-07T20:32:15.6006964Z else: 2025-05-07T20:32:15.6007347Z scale_ub_tensor = None 2025-05-07T20:32:15.6007595Z 2025-05-07T20:32:15.6007823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.6008135Z op = silu_mul_quant 2025-05-07T20:32:15.6008377Z if compiled: 2025-05-07T20:32:15.6008622Z op = torch.compile(op) 2025-05-07T20:32:15.6008914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6009194Z 2025-05-07T20:32:15.6009398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.6009591Z 2025-05-07T20:32:15.6009693Z moe/activation_test.py:117: 2025-05-07T20:32:15.6009989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6010318Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.6010595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6011152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.6011708Z return fn(*args, **kwargs) 2025-05-07T20:32:15.6012362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.6013042Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.6013577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.6014375Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.6015036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.6015564Z kernel = self.compile( 2025-05-07T20:32:15.6016106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.6016748Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.6017147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6017372Z 2025-05-07T20:32:15.6017586Z self = 2025-05-07T20:32:15.6018641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.6020130Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a215080>} 2025-05-07T20:32:15.6021454Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.6022538Z context = 2025-05-07T20:32:15.6022828Z 2025-05-07T20:32:15.6023001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.6023516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.6024026Z module_map=module_map) 2025-05-07T20:32:15.6024381Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.6024728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.6024987Z E ^ 2025-05-07T20:32:15.6025435Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.6025881Z 2025-05-07T20:32:15.6026285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.6026799Z 2025-05-07T20:32:15.6026902Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.6027310Z self=, 2025-05-07T20:32:15.6027784Z T=1, 2025-05-07T20:32:15.6027967Z D=7168, 2025-05-07T20:32:15.6028160Z scale_ub=1200.0, 2025-05-07T20:32:15.6028376Z contiguous=False, 2025-05-07T20:32:15.6028603Z compiled=True, 2025-05-07T20:32:15.6028810Z ) 2025-05-07T20:32:15.6029118Z self = 2025-05-07T20:32:15.6029598Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.6029858Z 2025-05-07T20:32:15.6029943Z @given( 2025-05-07T20:32:15.6030172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.6030475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.6030785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.6031112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.6031426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.6031712Z ) 2025-05-07T20:32:15.6032061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.6032492Z def test_silu_mul_quant( 2025-05-07T20:32:15.6032731Z self, 2025-05-07T20:32:15.6032926Z T: int, 2025-05-07T20:32:15.6033114Z D: int, 2025-05-07T20:32:15.6033339Z scale_ub: Optional[float], 2025-05-07T20:32:15.6033612Z contiguous: bool, 2025-05-07T20:32:15.6033842Z compiled: bool, 2025-05-07T20:32:15.6034066Z ) -> None: 2025-05-07T20:32:15.6034282Z torch.manual_seed(2025) 2025-05-07T20:32:15.6034524Z 2025-05-07T20:32:15.6034787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.6035129Z 2025-05-07T20:32:15.6035332Z x_sign = torch.sign(x) 2025-05-07T20:32:15.6035618Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.6035926Z x = x_sign * x_clamp 2025-05-07T20:32:15.6036166Z x0 = x[:, :D] 2025-05-07T20:32:15.6036373Z x1 = x[:, D:] 2025-05-07T20:32:15.6036581Z 2025-05-07T20:32:15.6036773Z if contiguous: 2025-05-07T20:32:15.6036998Z x0 = x0.contiguous() 2025-05-07T20:32:15.6037254Z x1 = x1.contiguous() 2025-05-07T20:32:15.6037489Z 2025-05-07T20:32:15.6037677Z if scale_ub is not None: 2025-05-07T20:32:15.6037947Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.6038275Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.6038573Z ) 2025-05-07T20:32:15.6038763Z else: 2025-05-07T20:32:15.6038969Z scale_ub_tensor = None 2025-05-07T20:32:15.6039209Z 2025-05-07T20:32:15.6039466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.6039858Z op = silu_mul_quant 2025-05-07T20:32:15.6040103Z if compiled: 2025-05-07T20:32:15.6040342Z op = torch.compile(op) 2025-05-07T20:32:15.6040637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6040912Z 2025-05-07T20:32:15.6041101Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.6041267Z 2025-05-07T20:32:15.6041363Z moe/activation_test.py:117: 2025-05-07T20:32:15.6041655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6042028Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.6042307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.6042856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.6043402Z return fn(*args, **kwargs) 2025-05-07T20:32:15.6044044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.6044720Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.6045249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.6045988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.6046649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.6047172Z kernel = self.compile( 2025-05-07T20:32:15.6047712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.6048353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.6048752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.6048974Z 2025-05-07T20:32:15.6049186Z self = 2025-05-07T20:32:15.6050244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.6051589Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a217380>} 2025-05-07T20:32:15.6052913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.6054052Z context = 2025-05-07T20:32:15.6054333Z 2025-05-07T20:32:15.6054502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.6055020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.6055482Z module_map=module_map) 2025-05-07T20:32:15.6055838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.6056186Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.6056444Z E ^ 2025-05-07T20:32:15.6056903Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.6057343Z 2025-05-07T20:32:15.6057748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.8124546Z 2025-05-07T20:32:15.8125071Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8126284Z self=, 2025-05-07T20:32:15.8127400Z T=1, 2025-05-07T20:32:15.8127881Z D=7168, 2025-05-07T20:32:15.8128381Z scale_ub=None, 2025-05-07T20:32:15.8128926Z contiguous=False, 2025-05-07T20:32:15.8129571Z compiled=True, 2025-05-07T20:32:15.8129800Z ) 2025-05-07T20:32:15.8130119Z self = 2025-05-07T20:32:15.8130593Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:15.8130863Z 2025-05-07T20:32:15.8130941Z @given( 2025-05-07T20:32:15.8131172Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8131478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8131780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8132202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8132520Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8132802Z ) 2025-05-07T20:32:15.8133153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8133593Z def test_silu_mul_quant( 2025-05-07T20:32:15.8133935Z self, 2025-05-07T20:32:15.8134135Z T: int, 2025-05-07T20:32:15.8134334Z D: int, 2025-05-07T20:32:15.8134542Z scale_ub: Optional[float], 2025-05-07T20:32:15.8134812Z contiguous: bool, 2025-05-07T20:32:15.8135049Z compiled: bool, 2025-05-07T20:32:15.8135268Z ) -> None: 2025-05-07T20:32:15.8135671Z torch.manual_seed(2025) 2025-05-07T20:32:15.8135915Z 2025-05-07T20:32:15.8136177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8136519Z 2025-05-07T20:32:15.8136712Z x_sign = torch.sign(x) 2025-05-07T20:32:15.8136996Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.8137307Z x = x_sign * x_clamp 2025-05-07T20:32:15.8137566Z x0 = x[:, :D] 2025-05-07T20:32:15.8137779Z x1 = x[:, D:] 2025-05-07T20:32:15.8137985Z 2025-05-07T20:32:15.8138162Z if contiguous: 2025-05-07T20:32:15.8138389Z x0 = x0.contiguous() 2025-05-07T20:32:15.8138642Z x1 = x1.contiguous() 2025-05-07T20:32:15.8138878Z 2025-05-07T20:32:15.8139069Z if scale_ub is not None: 2025-05-07T20:32:15.8139355Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.8139712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.8140017Z ) 2025-05-07T20:32:15.8140217Z else: 2025-05-07T20:32:15.8140426Z scale_ub_tensor = None 2025-05-07T20:32:15.8140667Z 2025-05-07T20:32:15.8140896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.8141208Z op = silu_mul_quant 2025-05-07T20:32:15.8141448Z if compiled: 2025-05-07T20:32:15.8141694Z op = torch.compile(op) 2025-05-07T20:32:15.8141987Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.8142254Z 2025-05-07T20:32:15.8142443Z y_fp8, y_scale = fn() 2025-05-07T20:32:15.8142724Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:15.8143028Z 2025-05-07T20:32:15.8143374Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.8143826Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:15.8144213Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:15.8144556Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:15.8144922Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.8145292Z 2025-05-07T20:32:15.8145555Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:15.8145821Z 2025-05-07T20:32:15.8145957Z moe/activation_test.py:126: 2025-05-07T20:32:15.8146315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8146643Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:15.8146971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:15.8147752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:15.8148574Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:15.8149109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.8149798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.8150479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:15.8151186Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:15.8151949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:15.8152575Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:15.8153169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:15.8153672Z fn() 2025-05-07T20:32:15.8154174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:15.8154750Z self.fn.run( 2025-05-07T20:32:15.8155213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.8155853Z kernel = self.compile( 2025-05-07T20:32:15.8156386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.8157026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.8157418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8157647Z 2025-05-07T20:32:15.8157850Z self = 2025-05-07T20:32:15.8158916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.8160282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b1cc220>} 2025-05-07T20:32:15.8161608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.8162608Z context = 2025-05-07T20:32:15.8162896Z 2025-05-07T20:32:15.8163086Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.8163800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.8164388Z module_map=module_map) 2025-05-07T20:32:15.8164796Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.8165275Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:15.8165633Z E ^ 2025-05-07T20:32:15.8166175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.8166788Z 2025-05-07T20:32:15.8167243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.8167749Z 2025-05-07T20:32:15.8167856Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.8168270Z self=, 2025-05-07T20:32:15.8168785Z T=1, 2025-05-07T20:32:15.8169035Z D=5120, 2025-05-07T20:32:15.8169299Z scale_ub=1200.0, 2025-05-07T20:32:15.8169558Z contiguous=False, 2025-05-07T20:32:15.8169784Z compiled=True, 2025-05-07T20:32:15.8169990Z ) 2025-05-07T20:32:15.8170302Z self = 2025-05-07T20:32:15.8170863Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.8171129Z 2025-05-07T20:32:15.8171240Z @given( 2025-05-07T20:32:15.8171543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.8171974Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.8172352Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.8172682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.8173002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.8173343Z ) 2025-05-07T20:32:15.8173818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.8174253Z def test_silu_mul_quant( 2025-05-07T20:32:15.8174492Z self, 2025-05-07T20:32:15.8174689Z T: int, 2025-05-07T20:32:15.8174882Z D: int, 2025-05-07T20:32:15.8175103Z scale_ub: Optional[float], 2025-05-07T20:32:15.8175377Z contiguous: bool, 2025-05-07T20:32:15.8175611Z compiled: bool, 2025-05-07T20:32:15.8175836Z ) -> None: 2025-05-07T20:32:15.8176058Z torch.manual_seed(2025) 2025-05-07T20:32:15.8176290Z 2025-05-07T20:32:15.8176638Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.8176979Z 2025-05-07T20:32:15.8177165Z x_sign = torch.sign(x) 2025-05-07T20:32:15.8177452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.8177757Z x = x_sign * x_clamp 2025-05-07T20:32:15.8177998Z x0 = x[:, :D] 2025-05-07T20:32:15.8178207Z x1 = x[:, D:] 2025-05-07T20:32:15.8178413Z 2025-05-07T20:32:15.8178598Z if contiguous: 2025-05-07T20:32:15.8178820Z x0 = x0.contiguous() 2025-05-07T20:32:15.8179073Z x1 = x1.contiguous() 2025-05-07T20:32:15.8179310Z 2025-05-07T20:32:15.8179492Z if scale_ub is not None: 2025-05-07T20:32:15.8179763Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.8180097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.8180400Z ) 2025-05-07T20:32:15.8180594Z else: 2025-05-07T20:32:15.8180803Z scale_ub_tensor = None 2025-05-07T20:32:15.8181044Z 2025-05-07T20:32:15.8181278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.8181588Z op = silu_mul_quant 2025-05-07T20:32:15.8181826Z if compiled: 2025-05-07T20:32:15.8182074Z op = torch.compile(op) 2025-05-07T20:32:15.8182370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.8182640Z 2025-05-07T20:32:15.8182826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.8182994Z 2025-05-07T20:32:15.8183089Z moe/activation_test.py:117: 2025-05-07T20:32:15.8183383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8183786Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.8184188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.8195898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.8196769Z return fn(*args, **kwargs) 2025-05-07T20:32:15.8197737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.8199052Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.8199851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.8200891Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.8201911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.8202674Z kernel = self.compile( 2025-05-07T20:32:15.8203482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.8204694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.8205311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.8205654Z 2025-05-07T20:32:15.8205963Z self = 2025-05-07T20:32:15.8207615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.8209859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9efb00>} 2025-05-07T20:32:15.8211687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.8212708Z context = 2025-05-07T20:32:15.8212996Z 2025-05-07T20:32:15.8213166Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.8213967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.8214446Z module_map=module_map) 2025-05-07T20:32:15.8214811Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.8215172Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.8215441Z E ^ 2025-05-07T20:32:15.8215900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.8216350Z 2025-05-07T20:32:15.8216763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9609922Z 2025-05-07T20:32:15.9610355Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9610998Z self=, 2025-05-07T20:32:15.9611449Z T=1, 2025-05-07T20:32:15.9611657Z D=5120, 2025-05-07T20:32:15.9611887Z scale_ub=1200.0, 2025-05-07T20:32:15.9612138Z contiguous=False, 2025-05-07T20:32:15.9612374Z compiled=False, 2025-05-07T20:32:15.9612588Z ) 2025-05-07T20:32:15.9612918Z self = 2025-05-07T20:32:15.9613430Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:15.9613826Z 2025-05-07T20:32:15.9613919Z @given( 2025-05-07T20:32:15.9614158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9614478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9614792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9615122Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9615461Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9615750Z ) 2025-05-07T20:32:15.9616099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9616554Z def test_silu_mul_quant( 2025-05-07T20:32:15.9616802Z self, 2025-05-07T20:32:15.9617002Z T: int, 2025-05-07T20:32:15.9617199Z D: int, 2025-05-07T20:32:15.9617419Z scale_ub: Optional[float], 2025-05-07T20:32:15.9617693Z contiguous: bool, 2025-05-07T20:32:15.9617934Z compiled: bool, 2025-05-07T20:32:15.9618170Z ) -> None: 2025-05-07T20:32:15.9618394Z torch.manual_seed(2025) 2025-05-07T20:32:15.9618634Z 2025-05-07T20:32:15.9618909Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9619255Z 2025-05-07T20:32:15.9619450Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9619745Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9620828Z x = x_sign * x_clamp 2025-05-07T20:32:15.9621066Z x0 = x[:, :D] 2025-05-07T20:32:15.9621286Z x1 = x[:, D:] 2025-05-07T20:32:15.9621499Z 2025-05-07T20:32:15.9621684Z if contiguous: 2025-05-07T20:32:15.9621923Z x0 = x0.contiguous() 2025-05-07T20:32:15.9622186Z x1 = x1.contiguous() 2025-05-07T20:32:15.9622425Z 2025-05-07T20:32:15.9622621Z if scale_ub is not None: 2025-05-07T20:32:15.9622899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9623344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9623655Z ) 2025-05-07T20:32:15.9623855Z else: 2025-05-07T20:32:15.9624072Z scale_ub_tensor = None 2025-05-07T20:32:15.9624324Z 2025-05-07T20:32:15.9624560Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9624882Z op = silu_mul_quant 2025-05-07T20:32:15.9625131Z if compiled: 2025-05-07T20:32:15.9625418Z op = torch.compile(op) 2025-05-07T20:32:15.9625719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9626001Z 2025-05-07T20:32:15.9626205Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9626370Z 2025-05-07T20:32:15.9626643Z moe/activation_test.py:117: 2025-05-07T20:32:15.9626950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9627291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9627573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9628275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9628973Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9629521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9630195Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9630864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9631401Z kernel = self.compile( 2025-05-07T20:32:15.9631954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9632616Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9633023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9633258Z 2025-05-07T20:32:15.9633473Z self = 2025-05-07T20:32:15.9634544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9635921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f56c0>} 2025-05-07T20:32:15.9637269Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9638285Z context = 2025-05-07T20:32:15.9638572Z 2025-05-07T20:32:15.9638751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9639264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9639784Z module_map=module_map) 2025-05-07T20:32:15.9640150Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9640496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9640818Z E ^ 2025-05-07T20:32:15.9641279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9641722Z 2025-05-07T20:32:15.9642145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9642654Z 2025-05-07T20:32:15.9642759Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9643169Z self=, 2025-05-07T20:32:15.9643618Z T=16384, 2025-05-07T20:32:15.9643808Z D=5120, 2025-05-07T20:32:15.9644004Z scale_ub=1200.0, 2025-05-07T20:32:15.9644230Z contiguous=False, 2025-05-07T20:32:15.9644450Z compiled=True, 2025-05-07T20:32:15.9644659Z ) 2025-05-07T20:32:15.9644976Z self = 2025-05-07T20:32:15.9645471Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:15.9645749Z 2025-05-07T20:32:15.9645828Z @given( 2025-05-07T20:32:15.9646062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:15.9646378Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:15.9646681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:15.9647097Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:15.9647426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:15.9647707Z ) 2025-05-07T20:32:15.9648056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:15.9648509Z def test_silu_mul_quant( 2025-05-07T20:32:15.9648754Z self, 2025-05-07T20:32:15.9648946Z T: int, 2025-05-07T20:32:15.9649146Z D: int, 2025-05-07T20:32:15.9649372Z scale_ub: Optional[float], 2025-05-07T20:32:15.9649637Z contiguous: bool, 2025-05-07T20:32:15.9650087Z compiled: bool, 2025-05-07T20:32:15.9650314Z ) -> None: 2025-05-07T20:32:15.9650532Z torch.manual_seed(2025) 2025-05-07T20:32:15.9650782Z 2025-05-07T20:32:15.9651058Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:15.9651536Z 2025-05-07T20:32:15.9651735Z x_sign = torch.sign(x) 2025-05-07T20:32:15.9652038Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:15.9652344Z x = x_sign * x_clamp 2025-05-07T20:32:15.9652588Z x0 = x[:, :D] 2025-05-07T20:32:15.9652811Z x1 = x[:, D:] 2025-05-07T20:32:15.9653018Z 2025-05-07T20:32:15.9653207Z if contiguous: 2025-05-07T20:32:15.9653437Z x0 = x0.contiguous() 2025-05-07T20:32:15.9653732Z x1 = x1.contiguous() 2025-05-07T20:32:15.9653978Z 2025-05-07T20:32:15.9654172Z if scale_ub is not None: 2025-05-07T20:32:15.9654449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:15.9654776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:15.9655089Z ) 2025-05-07T20:32:15.9655288Z else: 2025-05-07T20:32:15.9655494Z scale_ub_tensor = None 2025-05-07T20:32:15.9655749Z 2025-05-07T20:32:15.9655982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:15.9656293Z op = silu_mul_quant 2025-05-07T20:32:15.9656551Z if compiled: 2025-05-07T20:32:15.9656803Z op = torch.compile(op) 2025-05-07T20:32:15.9657095Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9657374Z 2025-05-07T20:32:15.9657571Z > y_fp8, y_scale = fn() 2025-05-07T20:32:15.9657733Z 2025-05-07T20:32:15.9657832Z moe/activation_test.py:117: 2025-05-07T20:32:15.9658132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9658465Z moe/activation_test.py:115: in fn 2025-05-07T20:32:15.9658748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:15.9659305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:15.9659934Z return fn(*args, **kwargs) 2025-05-07T20:32:15.9660592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:15.9661278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:15.9661817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:15.9662557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:15.9663361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:15.9663888Z kernel = self.compile( 2025-05-07T20:32:15.9664433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:15.9665089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:15.9665502Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:15.9665726Z 2025-05-07T20:32:15.9665939Z self = 2025-05-07T20:32:15.9667143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:15.9668504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f6fc0>} 2025-05-07T20:32:15.9669897Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:15.9670913Z context = 2025-05-07T20:32:15.9671206Z 2025-05-07T20:32:15.9671379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:15.9671896Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:15.9672375Z module_map=module_map) 2025-05-07T20:32:15.9672746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:15.9673104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:15.9673365Z E ^ 2025-05-07T20:32:15.9673829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:15.9674273Z 2025-05-07T20:32:15.9674698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:15.9675210Z 2025-05-07T20:32:15.9675326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:15.9675732Z self=, 2025-05-07T20:32:15.9676134Z T=2048, 2025-05-07T20:32:15.9676326Z D=7168, 2025-05-07T20:32:15.9676516Z scale_ub=1200.0, 2025-05-07T20:32:15.9676744Z contiguous=False, 2025-05-07T20:32:15.9676978Z compiled=True, 2025-05-07T20:32:16.1553043Z ) 2025-05-07T20:32:16.1554106Z self = 2025-05-07T20:32:16.1555492Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.1556060Z 2025-05-07T20:32:16.1556229Z @given( 2025-05-07T20:32:16.1556680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.1557295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.1557899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.1558539Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.1559166Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.1559796Z ) 2025-05-07T20:32:16.1560143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.1560579Z def test_silu_mul_quant( 2025-05-07T20:32:16.1560821Z self, 2025-05-07T20:32:16.1561014Z T: int, 2025-05-07T20:32:16.1561211Z D: int, 2025-05-07T20:32:16.1561428Z scale_ub: Optional[float], 2025-05-07T20:32:16.1561696Z contiguous: bool, 2025-05-07T20:32:16.1561928Z compiled: bool, 2025-05-07T20:32:16.1562154Z ) -> None: 2025-05-07T20:32:16.1562462Z torch.manual_seed(2025) 2025-05-07T20:32:16.1562699Z 2025-05-07T20:32:16.1562971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.1563312Z 2025-05-07T20:32:16.1563508Z x_sign = torch.sign(x) 2025-05-07T20:32:16.1563789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.1564099Z x = x_sign * x_clamp 2025-05-07T20:32:16.1564344Z x0 = x[:, :D] 2025-05-07T20:32:16.1564560Z x1 = x[:, D:] 2025-05-07T20:32:16.1564768Z 2025-05-07T20:32:16.1564956Z if contiguous: 2025-05-07T20:32:16.1565180Z x0 = x0.contiguous() 2025-05-07T20:32:16.1565437Z x1 = x1.contiguous() 2025-05-07T20:32:16.1565674Z 2025-05-07T20:32:16.1566017Z if scale_ub is not None: 2025-05-07T20:32:16.1566292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.1566625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.1566930Z ) 2025-05-07T20:32:16.1567125Z else: 2025-05-07T20:32:16.1567338Z scale_ub_tensor = None 2025-05-07T20:32:16.1567583Z 2025-05-07T20:32:16.1567809Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.1568119Z op = silu_mul_quant 2025-05-07T20:32:16.1568368Z if compiled: 2025-05-07T20:32:16.1568606Z op = torch.compile(op) 2025-05-07T20:32:16.1568897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1569179Z 2025-05-07T20:32:16.1569366Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.1569534Z 2025-05-07T20:32:16.1569630Z moe/activation_test.py:117: 2025-05-07T20:32:16.1569929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1570251Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.1570535Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1571088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.1571644Z return fn(*args, **kwargs) 2025-05-07T20:32:16.1572291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.1572968Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.1573500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.1574317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.1574970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.1575499Z kernel = self.compile( 2025-05-07T20:32:16.1576040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.1576686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.1577082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1577309Z 2025-05-07T20:32:16.1577520Z self = 2025-05-07T20:32:16.1578582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.1580008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de58a0>} 2025-05-07T20:32:16.1581349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.1582360Z context = 2025-05-07T20:32:16.1582690Z 2025-05-07T20:32:16.1582859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.1583365Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.1583827Z module_map=module_map) 2025-05-07T20:32:16.1584189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.1584543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.1584797Z E ^ 2025-05-07T20:32:16.1585254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.1585695Z 2025-05-07T20:32:16.1586192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.1586702Z 2025-05-07T20:32:16.1586804Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.1587220Z self=, 2025-05-07T20:32:16.1587618Z T=1, 2025-05-07T20:32:16.1587801Z D=5120, 2025-05-07T20:32:16.1587993Z scale_ub=None, 2025-05-07T20:32:16.1588208Z contiguous=False, 2025-05-07T20:32:16.1588468Z compiled=False, 2025-05-07T20:32:16.1588674Z ) 2025-05-07T20:32:16.1588985Z self = 2025-05-07T20:32:16.1589472Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.1589737Z 2025-05-07T20:32:16.1589818Z @given( 2025-05-07T20:32:16.1590052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.1590361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.1590671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.1591001Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.1591318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.1591605Z ) 2025-05-07T20:32:16.1591952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.1592389Z def test_silu_mul_quant( 2025-05-07T20:32:16.1592627Z self, 2025-05-07T20:32:16.1592821Z T: int, 2025-05-07T20:32:16.1593014Z D: int, 2025-05-07T20:32:16.1593225Z scale_ub: Optional[float], 2025-05-07T20:32:16.1593493Z contiguous: bool, 2025-05-07T20:32:16.1593736Z compiled: bool, 2025-05-07T20:32:16.1593949Z ) -> None: 2025-05-07T20:32:16.1594166Z torch.manual_seed(2025) 2025-05-07T20:32:16.1594406Z 2025-05-07T20:32:16.1594669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.1595014Z 2025-05-07T20:32:16.1595219Z x_sign = torch.sign(x) 2025-05-07T20:32:16.1595504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.1595811Z x = x_sign * x_clamp 2025-05-07T20:32:16.1596051Z x0 = x[:, :D] 2025-05-07T20:32:16.1596262Z x1 = x[:, D:] 2025-05-07T20:32:16.1596468Z 2025-05-07T20:32:16.1596653Z if contiguous: 2025-05-07T20:32:16.1596875Z x0 = x0.contiguous() 2025-05-07T20:32:16.1597134Z x1 = x1.contiguous() 2025-05-07T20:32:16.1597378Z 2025-05-07T20:32:16.1597567Z if scale_ub is not None: 2025-05-07T20:32:16.1597838Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.1598418Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.1598880Z ) 2025-05-07T20:32:16.1599071Z else: 2025-05-07T20:32:16.1599285Z scale_ub_tensor = None 2025-05-07T20:32:16.1599572Z 2025-05-07T20:32:16.1599822Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.1600143Z op = silu_mul_quant 2025-05-07T20:32:16.1600394Z if compiled: 2025-05-07T20:32:16.1600640Z op = torch.compile(op) 2025-05-07T20:32:16.1600941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1601298Z 2025-05-07T20:32:16.1601485Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.1601655Z 2025-05-07T20:32:16.1601752Z moe/activation_test.py:117: 2025-05-07T20:32:16.1602051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1602382Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.1602663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1603346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.1604027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.1604676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.1605356Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.1606018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.1606552Z kernel = self.compile( 2025-05-07T20:32:16.1607083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.1607734Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.1608128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1608354Z 2025-05-07T20:32:16.1608559Z self = 2025-05-07T20:32:16.1609675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.1611035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de53a0>} 2025-05-07T20:32:16.1612366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.1613379Z context = 2025-05-07T20:32:16.1613726Z 2025-05-07T20:32:16.1613890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.1614409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.1614874Z module_map=module_map) 2025-05-07T20:32:16.1615244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.1615594Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.1615855Z E ^ 2025-05-07T20:32:16.1616314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.1616757Z 2025-05-07T20:32:16.1617166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.1617680Z 2025-05-07T20:32:16.1617781Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.1618193Z self=, 2025-05-07T20:32:16.1618593Z T=4096, 2025-05-07T20:32:16.1618837Z D=7168, 2025-05-07T20:32:16.1619034Z scale_ub=1200.0, 2025-05-07T20:32:16.1619256Z contiguous=False, 2025-05-07T20:32:16.1619472Z compiled=False, 2025-05-07T20:32:16.1619679Z ) 2025-05-07T20:32:16.1619996Z self = 2025-05-07T20:32:16.1620485Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.1620762Z 2025-05-07T20:32:16.1620838Z @given( 2025-05-07T20:32:16.1630273Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.1630702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.1631013Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.1631347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.1631678Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.1631957Z ) 2025-05-07T20:32:16.1632309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.1632754Z def test_silu_mul_quant( 2025-05-07T20:32:16.1632986Z self, 2025-05-07T20:32:16.1633178Z T: int, 2025-05-07T20:32:16.1633374Z D: int, 2025-05-07T20:32:16.1633582Z scale_ub: Optional[float], 2025-05-07T20:32:16.1633857Z contiguous: bool, 2025-05-07T20:32:16.1634185Z compiled: bool, 2025-05-07T20:32:16.1634400Z ) -> None: 2025-05-07T20:32:16.1634600Z torch.manual_seed(2025) 2025-05-07T20:32:16.1634841Z 2025-05-07T20:32:16.1635104Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.1635455Z 2025-05-07T20:32:16.1635648Z x_sign = torch.sign(x) 2025-05-07T20:32:16.1635933Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.1636243Z x = x_sign * x_clamp 2025-05-07T20:32:16.1636481Z x0 = x[:, :D] 2025-05-07T20:32:16.1636696Z x1 = x[:, D:] 2025-05-07T20:32:16.1636895Z 2025-05-07T20:32:16.1637084Z if contiguous: 2025-05-07T20:32:16.1637309Z x0 = x0.contiguous() 2025-05-07T20:32:16.1637553Z x1 = x1.contiguous() 2025-05-07T20:32:16.1637785Z 2025-05-07T20:32:16.1637977Z if scale_ub is not None: 2025-05-07T20:32:16.1638239Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.1638575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.1638881Z ) 2025-05-07T20:32:16.1639066Z else: 2025-05-07T20:32:16.1639272Z scale_ub_tensor = None 2025-05-07T20:32:16.1639529Z 2025-05-07T20:32:16.1639755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.1640067Z op = silu_mul_quant 2025-05-07T20:32:16.1640307Z if compiled: 2025-05-07T20:32:16.1640543Z op = torch.compile(op) 2025-05-07T20:32:16.1640826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1641100Z 2025-05-07T20:32:16.1641293Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.1641455Z 2025-05-07T20:32:16.1641553Z moe/activation_test.py:117: 2025-05-07T20:32:16.1641846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1642171Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.1642456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.1643150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.1643829Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.1644362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.1645039Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.1645686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.1646214Z kernel = self.compile( 2025-05-07T20:32:16.1646813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.1647471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.1647868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.1648098Z 2025-05-07T20:32:16.1648307Z self = 2025-05-07T20:32:16.1649374Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.1650774Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de63e0>} 2025-05-07T20:32:16.1652095Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.1653118Z context = 2025-05-07T20:32:16.1653410Z 2025-05-07T20:32:16.1653719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.1654237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.1654689Z module_map=module_map) 2025-05-07T20:32:16.1655057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.1655408Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.1655665Z E ^ 2025-05-07T20:32:16.1656114Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.1656561Z 2025-05-07T20:32:16.1656970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3206629Z 2025-05-07T20:32:16.3207036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3207615Z self=, 2025-05-07T20:32:16.3208052Z T=16384, 2025-05-07T20:32:16.3208289Z D=7168, 2025-05-07T20:32:16.3208480Z scale_ub=None, 2025-05-07T20:32:16.3208705Z contiguous=True, 2025-05-07T20:32:16.3208931Z compiled=True, 2025-05-07T20:32:16.3209143Z ) 2025-05-07T20:32:16.3209468Z self = 2025-05-07T20:32:16.3210012Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:16.3210279Z 2025-05-07T20:32:16.3210358Z @given( 2025-05-07T20:32:16.3210594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3210908Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3211209Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3211548Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3211879Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3212168Z ) 2025-05-07T20:32:16.3212514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3212954Z def test_silu_mul_quant( 2025-05-07T20:32:16.3213199Z self, 2025-05-07T20:32:16.3213391Z T: int, 2025-05-07T20:32:16.3213594Z D: int, 2025-05-07T20:32:16.3214021Z scale_ub: Optional[float], 2025-05-07T20:32:16.3214291Z contiguous: bool, 2025-05-07T20:32:16.3214533Z compiled: bool, 2025-05-07T20:32:16.3214763Z ) -> None: 2025-05-07T20:32:16.3214975Z torch.manual_seed(2025) 2025-05-07T20:32:16.3215216Z 2025-05-07T20:32:16.3215488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3215824Z 2025-05-07T20:32:16.3216030Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3216622Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3216928Z x = x_sign * x_clamp 2025-05-07T20:32:16.3217173Z x0 = x[:, :D] 2025-05-07T20:32:16.3217392Z x1 = x[:, D:] 2025-05-07T20:32:16.3217593Z 2025-05-07T20:32:16.3217795Z if contiguous: 2025-05-07T20:32:16.3218031Z x0 = x0.contiguous() 2025-05-07T20:32:16.3218286Z x1 = x1.contiguous() 2025-05-07T20:32:16.3218523Z 2025-05-07T20:32:16.3218715Z if scale_ub is not None: 2025-05-07T20:32:16.3219089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3219415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3219730Z ) 2025-05-07T20:32:16.3219925Z else: 2025-05-07T20:32:16.3220134Z scale_ub_tensor = None 2025-05-07T20:32:16.3220385Z 2025-05-07T20:32:16.3220619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3220931Z op = silu_mul_quant 2025-05-07T20:32:16.3221183Z if compiled: 2025-05-07T20:32:16.3221435Z op = torch.compile(op) 2025-05-07T20:32:16.3221727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3222002Z 2025-05-07T20:32:16.3222357Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3222525Z 2025-05-07T20:32:16.3222630Z moe/activation_test.py:117: 2025-05-07T20:32:16.3222918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3223250Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3223532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3224088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3224648Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3225304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3225986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3226514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3227195Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3227860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3228385Z kernel = self.compile( 2025-05-07T20:32:16.3228926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3229603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3230030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3230256Z 2025-05-07T20:32:16.3230464Z self = 2025-05-07T20:32:16.3231537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3232913Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4ef2a20>} 2025-05-07T20:32:16.3234246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3235255Z context = 2025-05-07T20:32:16.3235548Z 2025-05-07T20:32:16.3235712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3236232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3236779Z module_map=module_map) 2025-05-07T20:32:16.3237138Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3237490Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3237756Z E ^ 2025-05-07T20:32:16.3238215Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3238664Z 2025-05-07T20:32:16.3239075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.3239675Z 2025-05-07T20:32:16.3239776Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.3240186Z self=, 2025-05-07T20:32:16.3240576Z T=4096, 2025-05-07T20:32:16.3240765Z D=5120, 2025-05-07T20:32:16.3240957Z scale_ub=None, 2025-05-07T20:32:16.3241168Z contiguous=False, 2025-05-07T20:32:16.3241395Z compiled=True, 2025-05-07T20:32:16.3241596Z ) 2025-05-07T20:32:16.3241903Z self = 2025-05-07T20:32:16.3242388Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.3242660Z 2025-05-07T20:32:16.3242820Z @given( 2025-05-07T20:32:16.3243051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.3243354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.3243658Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.3243987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.3244305Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.3244585Z ) 2025-05-07T20:32:16.3244930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.3245365Z def test_silu_mul_quant( 2025-05-07T20:32:16.3245599Z self, 2025-05-07T20:32:16.3245797Z T: int, 2025-05-07T20:32:16.3245990Z D: int, 2025-05-07T20:32:16.3246201Z scale_ub: Optional[float], 2025-05-07T20:32:16.3246473Z contiguous: bool, 2025-05-07T20:32:16.3246710Z compiled: bool, 2025-05-07T20:32:16.3246925Z ) -> None: 2025-05-07T20:32:16.3247146Z torch.manual_seed(2025) 2025-05-07T20:32:16.3247382Z 2025-05-07T20:32:16.3247645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.3247989Z 2025-05-07T20:32:16.3248181Z x_sign = torch.sign(x) 2025-05-07T20:32:16.3248467Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.3248779Z x = x_sign * x_clamp 2025-05-07T20:32:16.3249019Z x0 = x[:, :D] 2025-05-07T20:32:16.3249229Z x1 = x[:, D:] 2025-05-07T20:32:16.3249455Z 2025-05-07T20:32:16.3249653Z if contiguous: 2025-05-07T20:32:16.3249917Z x0 = x0.contiguous() 2025-05-07T20:32:16.3250175Z x1 = x1.contiguous() 2025-05-07T20:32:16.3250412Z 2025-05-07T20:32:16.3250609Z if scale_ub is not None: 2025-05-07T20:32:16.3250884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.3251209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.3251520Z ) 2025-05-07T20:32:16.3251720Z else: 2025-05-07T20:32:16.3251923Z scale_ub_tensor = None 2025-05-07T20:32:16.3252179Z 2025-05-07T20:32:16.3252411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.3252725Z op = silu_mul_quant 2025-05-07T20:32:16.3252968Z if compiled: 2025-05-07T20:32:16.3253216Z op = torch.compile(op) 2025-05-07T20:32:16.3253513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3253858Z 2025-05-07T20:32:16.3254051Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.3254214Z 2025-05-07T20:32:16.3254317Z moe/activation_test.py:117: 2025-05-07T20:32:16.3254604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3254992Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.3255273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.3255823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.3256380Z return fn(*args, **kwargs) 2025-05-07T20:32:16.3257032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.3257759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.3258289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.3258962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.3259622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.3260155Z kernel = self.compile( 2025-05-07T20:32:16.3260686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.3261335Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.3261813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.3262039Z 2025-05-07T20:32:16.3262246Z self = 2025-05-07T20:32:16.3263314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.3264674Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4ef3c40>} 2025-05-07T20:32:16.3266010Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.3267026Z context = 2025-05-07T20:32:16.3267311Z 2025-05-07T20:32:16.3267475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.3267987Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.3268453Z module_map=module_map) 2025-05-07T20:32:16.3268818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.3269166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.3269448Z E ^ 2025-05-07T20:32:16.3269940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.3270386Z 2025-05-07T20:32:16.3270795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4653301Z 2025-05-07T20:32:16.4653522Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4654070Z self=, 2025-05-07T20:32:16.4654622Z T=4096, 2025-05-07T20:32:16.4654881Z D=5120, 2025-05-07T20:32:16.4655084Z scale_ub=1200.0, 2025-05-07T20:32:16.4655308Z contiguous=False, 2025-05-07T20:32:16.4655544Z compiled=False, 2025-05-07T20:32:16.4655760Z ) 2025-05-07T20:32:16.4656074Z self = 2025-05-07T20:32:16.4656573Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.4656852Z 2025-05-07T20:32:16.4656930Z @given( 2025-05-07T20:32:16.4657162Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4657473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4657996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4658324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4658646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4658930Z ) 2025-05-07T20:32:16.4659286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4659773Z def test_silu_mul_quant( 2025-05-07T20:32:16.4660015Z self, 2025-05-07T20:32:16.4660312Z T: int, 2025-05-07T20:32:16.4660506Z D: int, 2025-05-07T20:32:16.4660729Z scale_ub: Optional[float], 2025-05-07T20:32:16.4661000Z contiguous: bool, 2025-05-07T20:32:16.4661245Z compiled: bool, 2025-05-07T20:32:16.4661470Z ) -> None: 2025-05-07T20:32:16.4661688Z torch.manual_seed(2025) 2025-05-07T20:32:16.4661932Z 2025-05-07T20:32:16.4662202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4662546Z 2025-05-07T20:32:16.4662740Z x_sign = torch.sign(x) 2025-05-07T20:32:16.4663032Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.4663342Z x = x_sign * x_clamp 2025-05-07T20:32:16.4663589Z x0 = x[:, :D] 2025-05-07T20:32:16.4663952Z x1 = x[:, D:] 2025-05-07T20:32:16.4664164Z 2025-05-07T20:32:16.4664353Z if contiguous: 2025-05-07T20:32:16.4664581Z x0 = x0.contiguous() 2025-05-07T20:32:16.4664847Z x1 = x1.contiguous() 2025-05-07T20:32:16.4665090Z 2025-05-07T20:32:16.4665278Z if scale_ub is not None: 2025-05-07T20:32:16.4665554Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.4665892Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.4666200Z ) 2025-05-07T20:32:16.4666399Z else: 2025-05-07T20:32:16.4666613Z scale_ub_tensor = None 2025-05-07T20:32:16.4666870Z 2025-05-07T20:32:16.4667103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4667420Z op = silu_mul_quant 2025-05-07T20:32:16.4667672Z if compiled: 2025-05-07T20:32:16.4667917Z op = torch.compile(op) 2025-05-07T20:32:16.4668214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4668502Z 2025-05-07T20:32:16.4668692Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.4668861Z 2025-05-07T20:32:16.4668961Z moe/activation_test.py:117: 2025-05-07T20:32:16.4669261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4669593Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.4669920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4670618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.4671306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.4671834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.4672515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.4673177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.4673711Z kernel = self.compile( 2025-05-07T20:32:16.4674246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.4674906Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.4675307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4675537Z 2025-05-07T20:32:16.4675744Z self = 2025-05-07T20:32:16.4676813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.4678239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda52582c0>} 2025-05-07T20:32:16.4679564Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.4680620Z context = 2025-05-07T20:32:16.4680902Z 2025-05-07T20:32:16.4681068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.4681585Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.4682052Z module_map=module_map) 2025-05-07T20:32:16.4682417Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.4682769Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.4683034Z E ^ 2025-05-07T20:32:16.4683574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.4684018Z 2025-05-07T20:32:16.4684430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4684942Z 2025-05-07T20:32:16.4685050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4685465Z self=, 2025-05-07T20:32:16.4685866Z T=4096, 2025-05-07T20:32:16.4686053Z D=5120, 2025-05-07T20:32:16.4686254Z scale_ub=1200.0, 2025-05-07T20:32:16.4686482Z contiguous=False, 2025-05-07T20:32:16.4686704Z compiled=True, 2025-05-07T20:32:16.4686908Z ) 2025-05-07T20:32:16.4687232Z self = 2025-05-07T20:32:16.4687719Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.4687999Z 2025-05-07T20:32:16.4688077Z @given( 2025-05-07T20:32:16.4688309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.4688624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.4688936Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.4689266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.4689597Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.4689932Z ) 2025-05-07T20:32:16.4690283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.4690728Z def test_silu_mul_quant( 2025-05-07T20:32:16.4690973Z self, 2025-05-07T20:32:16.4691167Z T: int, 2025-05-07T20:32:16.4691367Z D: int, 2025-05-07T20:32:16.4691581Z scale_ub: Optional[float], 2025-05-07T20:32:16.4691857Z contiguous: bool, 2025-05-07T20:32:16.4692102Z compiled: bool, 2025-05-07T20:32:16.4692322Z ) -> None: 2025-05-07T20:32:16.4692546Z torch.manual_seed(2025) 2025-05-07T20:32:16.4692789Z 2025-05-07T20:32:16.4693061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.4693405Z 2025-05-07T20:32:16.4693600Z x_sign = torch.sign(x) 2025-05-07T20:32:16.4693991Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.4694308Z x = x_sign * x_clamp 2025-05-07T20:32:16.4694548Z x0 = x[:, :D] 2025-05-07T20:32:16.4694766Z x1 = x[:, D:] 2025-05-07T20:32:16.4694971Z 2025-05-07T20:32:16.4695156Z if contiguous: 2025-05-07T20:32:16.4695386Z x0 = x0.contiguous() 2025-05-07T20:32:16.4695637Z x1 = x1.contiguous() 2025-05-07T20:32:16.4695877Z 2025-05-07T20:32:16.4696072Z if scale_ub is not None: 2025-05-07T20:32:16.4696393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.4696729Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.4697037Z ) 2025-05-07T20:32:16.4697227Z else: 2025-05-07T20:32:16.4697436Z scale_ub_tensor = None 2025-05-07T20:32:16.4697692Z 2025-05-07T20:32:16.4697937Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.4706204Z op = silu_mul_quant 2025-05-07T20:32:16.4706471Z if compiled: 2025-05-07T20:32:16.4706714Z op = torch.compile(op) 2025-05-07T20:32:16.4707142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4707420Z 2025-05-07T20:32:16.4707605Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.4707776Z 2025-05-07T20:32:16.4707871Z moe/activation_test.py:117: 2025-05-07T20:32:16.4708167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4708487Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.4708772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.4709331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.4709889Z return fn(*args, **kwargs) 2025-05-07T20:32:16.4710680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.4711365Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.4711897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.4712567Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.4713228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.4713755Z kernel = self.compile( 2025-05-07T20:32:16.4714295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.4714939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.4715339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.4715577Z 2025-05-07T20:32:16.4715788Z self = 2025-05-07T20:32:16.4716856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.4718208Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5259b20>} 2025-05-07T20:32:16.4719538Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.4720551Z context = 2025-05-07T20:32:16.4720835Z 2025-05-07T20:32:16.4721012Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.4721525Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.4721980Z module_map=module_map) 2025-05-07T20:32:16.4722345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.4722697Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.4722947Z E ^ 2025-05-07T20:32:16.4723399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.4723839Z 2025-05-07T20:32:16.4724254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.4724835Z 2025-05-07T20:32:16.4724946Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.4725343Z self=, 2025-05-07T20:32:16.4725741Z T=2048, 2025-05-07T20:32:16.4725931Z D=7168, 2025-05-07T20:32:16.4726121Z scale_ub=1200.0, 2025-05-07T20:32:16.4726345Z contiguous=False, 2025-05-07T20:32:16.4726569Z compiled=False, 2025-05-07T20:32:16.6687422Z ) 2025-05-07T20:32:16.6688012Z self = 2025-05-07T20:32:16.6688971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:16.6689259Z 2025-05-07T20:32:16.6689342Z @given( 2025-05-07T20:32:16.6689598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6689949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6690262Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6690606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6690935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6691229Z ) 2025-05-07T20:32:16.6691585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6692201Z def test_silu_mul_quant( 2025-05-07T20:32:16.6692446Z self, 2025-05-07T20:32:16.6692648Z T: int, 2025-05-07T20:32:16.6692849Z D: int, 2025-05-07T20:32:16.6693065Z scale_ub: Optional[float], 2025-05-07T20:32:16.6693347Z contiguous: bool, 2025-05-07T20:32:16.6693591Z compiled: bool, 2025-05-07T20:32:16.6693956Z ) -> None: 2025-05-07T20:32:16.6694177Z torch.manual_seed(2025) 2025-05-07T20:32:16.6694422Z 2025-05-07T20:32:16.6694695Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6695046Z 2025-05-07T20:32:16.6695246Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6695536Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6695849Z x = x_sign * x_clamp 2025-05-07T20:32:16.6696095Z x0 = x[:, :D] 2025-05-07T20:32:16.6696313Z x1 = x[:, D:] 2025-05-07T20:32:16.6696527Z 2025-05-07T20:32:16.6696716Z if contiguous: 2025-05-07T20:32:16.6696955Z x0 = x0.contiguous() 2025-05-07T20:32:16.6697216Z x1 = x1.contiguous() 2025-05-07T20:32:16.6697458Z 2025-05-07T20:32:16.6697656Z if scale_ub is not None: 2025-05-07T20:32:16.6697934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6698539Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6698853Z ) 2025-05-07T20:32:16.6699051Z else: 2025-05-07T20:32:16.6699270Z scale_ub_tensor = None 2025-05-07T20:32:16.6699532Z 2025-05-07T20:32:16.6699760Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6700073Z op = silu_mul_quant 2025-05-07T20:32:16.6700324Z if compiled: 2025-05-07T20:32:16.6700568Z op = torch.compile(op) 2025-05-07T20:32:16.6700864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6701143Z 2025-05-07T20:32:16.6701327Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6701495Z 2025-05-07T20:32:16.6701599Z moe/activation_test.py:117: 2025-05-07T20:32:16.6701895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6702226Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6702504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6703187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6703867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6704392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6705066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6705817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6706347Z kernel = self.compile( 2025-05-07T20:32:16.6706883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6707531Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6707928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6708216Z 2025-05-07T20:32:16.6708429Z self = 2025-05-07T20:32:16.6709491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6710862Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda525a700>} 2025-05-07T20:32:16.6712299Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6713305Z context = 2025-05-07T20:32:16.6713593Z 2025-05-07T20:32:16.6713756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6714269Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6714729Z module_map=module_map) 2025-05-07T20:32:16.6715093Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6715438Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6715701Z E ^ 2025-05-07T20:32:16.6716158Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6716595Z 2025-05-07T20:32:16.6717010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6717523Z 2025-05-07T20:32:16.6717624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6718030Z self=, 2025-05-07T20:32:16.6718431Z T=1, 2025-05-07T20:32:16.6718610Z D=7168, 2025-05-07T20:32:16.6718804Z scale_ub=None, 2025-05-07T20:32:16.6719019Z contiguous=True, 2025-05-07T20:32:16.6719239Z compiled=False, 2025-05-07T20:32:16.6719441Z ) 2025-05-07T20:32:16.6719757Z self = 2025-05-07T20:32:16.6720232Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:16.6720498Z 2025-05-07T20:32:16.6720576Z @given( 2025-05-07T20:32:16.6720810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6721128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6721433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6721760Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6722086Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6722364Z ) 2025-05-07T20:32:16.6722717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6723164Z def test_silu_mul_quant( 2025-05-07T20:32:16.6723401Z self, 2025-05-07T20:32:16.6723600Z T: int, 2025-05-07T20:32:16.6723797Z D: int, 2025-05-07T20:32:16.6724008Z scale_ub: Optional[float], 2025-05-07T20:32:16.6724281Z contiguous: bool, 2025-05-07T20:32:16.6724521Z compiled: bool, 2025-05-07T20:32:16.6724793Z ) -> None: 2025-05-07T20:32:16.6725010Z torch.manual_seed(2025) 2025-05-07T20:32:16.6725248Z 2025-05-07T20:32:16.6725520Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6725859Z 2025-05-07T20:32:16.6726055Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6726351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6726657Z x = x_sign * x_clamp 2025-05-07T20:32:16.6726895Z x0 = x[:, :D] 2025-05-07T20:32:16.6727111Z x1 = x[:, D:] 2025-05-07T20:32:16.6727361Z 2025-05-07T20:32:16.6727544Z if contiguous: 2025-05-07T20:32:16.6727774Z x0 = x0.contiguous() 2025-05-07T20:32:16.6728021Z x1 = x1.contiguous() 2025-05-07T20:32:16.6728264Z 2025-05-07T20:32:16.6728454Z if scale_ub is not None: 2025-05-07T20:32:16.6728718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6729050Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6729360Z ) 2025-05-07T20:32:16.6729548Z else: 2025-05-07T20:32:16.6729756Z scale_ub_tensor = None 2025-05-07T20:32:16.6730005Z 2025-05-07T20:32:16.6730228Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6730619Z op = silu_mul_quant 2025-05-07T20:32:16.6730865Z if compiled: 2025-05-07T20:32:16.6731114Z op = torch.compile(op) 2025-05-07T20:32:16.6731403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6731681Z 2025-05-07T20:32:16.6731867Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6732031Z 2025-05-07T20:32:16.6732129Z moe/activation_test.py:117: 2025-05-07T20:32:16.6732421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6732748Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6733024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6733861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6734546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6735078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6735747Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6736405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6736936Z kernel = self.compile( 2025-05-07T20:32:16.6737466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6738117Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6738515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6738739Z 2025-05-07T20:32:16.6738944Z self = 2025-05-07T20:32:16.6740028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6741376Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda525ba60>} 2025-05-07T20:32:16.6742705Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6743704Z context = 2025-05-07T20:32:16.6743992Z 2025-05-07T20:32:16.6744156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6744722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6745185Z module_map=module_map) 2025-05-07T20:32:16.6745542Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6745897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6746157Z E ^ 2025-05-07T20:32:16.6746610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6747101Z 2025-05-07T20:32:16.6747508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.6748015Z 2025-05-07T20:32:16.6748118Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.6748529Z self=, 2025-05-07T20:32:16.6748921Z T=16384, 2025-05-07T20:32:16.6749114Z D=7168, 2025-05-07T20:32:16.6749311Z scale_ub=1200.0, 2025-05-07T20:32:16.6749530Z contiguous=False, 2025-05-07T20:32:16.6749752Z compiled=True, 2025-05-07T20:32:16.6749951Z ) 2025-05-07T20:32:16.6750258Z self = 2025-05-07T20:32:16.6750857Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:16.6751135Z 2025-05-07T20:32:16.6751212Z @given( 2025-05-07T20:32:16.6751440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.6751749Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.6752053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.6752378Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.6752697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.6752981Z ) 2025-05-07T20:32:16.6753328Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.6753760Z def test_silu_mul_quant( 2025-05-07T20:32:16.6754006Z self, 2025-05-07T20:32:16.6754198Z T: int, 2025-05-07T20:32:16.6754389Z D: int, 2025-05-07T20:32:16.6754610Z scale_ub: Optional[float], 2025-05-07T20:32:16.6754877Z contiguous: bool, 2025-05-07T20:32:16.6755122Z compiled: bool, 2025-05-07T20:32:16.6755357Z ) -> None: 2025-05-07T20:32:16.6755572Z torch.manual_seed(2025) 2025-05-07T20:32:16.6755813Z 2025-05-07T20:32:16.6756073Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.6756418Z 2025-05-07T20:32:16.6756609Z x_sign = torch.sign(x) 2025-05-07T20:32:16.6756891Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.6757198Z x = x_sign * x_clamp 2025-05-07T20:32:16.6757437Z x0 = x[:, :D] 2025-05-07T20:32:16.6757658Z x1 = x[:, D:] 2025-05-07T20:32:16.6757857Z 2025-05-07T20:32:16.6758042Z if contiguous: 2025-05-07T20:32:16.6758272Z x0 = x0.contiguous() 2025-05-07T20:32:16.6758525Z x1 = x1.contiguous() 2025-05-07T20:32:16.6758766Z 2025-05-07T20:32:16.6758958Z if scale_ub is not None: 2025-05-07T20:32:16.6759228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.6759566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.6759869Z ) 2025-05-07T20:32:16.6760055Z else: 2025-05-07T20:32:16.6760268Z scale_ub_tensor = None 2025-05-07T20:32:16.6760520Z 2025-05-07T20:32:16.6760745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.6761056Z op = silu_mul_quant 2025-05-07T20:32:16.6761304Z if compiled: 2025-05-07T20:32:16.6761543Z op = torch.compile(op) 2025-05-07T20:32:16.6761839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6762111Z 2025-05-07T20:32:16.6762296Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.6762463Z 2025-05-07T20:32:16.6762614Z moe/activation_test.py:117: 2025-05-07T20:32:16.6762907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6763231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.6763506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.6764064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.6764615Z return fn(*args, **kwargs) 2025-05-07T20:32:16.6765258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.6765974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.6766503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.6767173Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.6767822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.6768352Z kernel = self.compile( 2025-05-07T20:32:16.6768888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.6769611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.6770002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.6770233Z 2025-05-07T20:32:16.6770438Z self = 2025-05-07T20:32:16.6771501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.6772845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e4d60>} 2025-05-07T20:32:16.6774225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.6775238Z context = 2025-05-07T20:32:16.6775531Z 2025-05-07T20:32:16.6775694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.6776206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.6776658Z module_map=module_map) 2025-05-07T20:32:16.6777017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.6777368Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.6777619Z E ^ 2025-05-07T20:32:16.6778074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.6778519Z 2025-05-07T20:32:16.6778927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.8091776Z 2025-05-07T20:32:16.8092570Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.8093211Z self=, 2025-05-07T20:32:16.8093840Z T=1, 2025-05-07T20:32:16.8094095Z D=7168, 2025-05-07T20:32:16.8094366Z scale_ub=None, 2025-05-07T20:32:16.8094640Z contiguous=False, 2025-05-07T20:32:16.8094931Z compiled=False, 2025-05-07T20:32:16.8095194Z ) 2025-05-07T20:32:16.8095639Z self = 2025-05-07T20:32:16.8096182Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:16.8096440Z 2025-05-07T20:32:16.8096520Z @given( 2025-05-07T20:32:16.8096753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.8097373Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.8097671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.8098000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.8098671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.8098963Z ) 2025-05-07T20:32:16.8099303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.8099742Z def test_silu_mul_quant( 2025-05-07T20:32:16.8100094Z self, 2025-05-07T20:32:16.8100282Z T: int, 2025-05-07T20:32:16.8100478Z D: int, 2025-05-07T20:32:16.8100693Z scale_ub: Optional[float], 2025-05-07T20:32:16.8100959Z contiguous: bool, 2025-05-07T20:32:16.8101197Z compiled: bool, 2025-05-07T20:32:16.8101426Z ) -> None: 2025-05-07T20:32:16.8101635Z torch.manual_seed(2025) 2025-05-07T20:32:16.8101874Z 2025-05-07T20:32:16.8102149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.8102486Z 2025-05-07T20:32:16.8102681Z x_sign = torch.sign(x) 2025-05-07T20:32:16.8102971Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.8103453Z x = x_sign * x_clamp 2025-05-07T20:32:16.8103701Z x0 = x[:, :D] 2025-05-07T20:32:16.8103917Z x1 = x[:, D:] 2025-05-07T20:32:16.8104121Z 2025-05-07T20:32:16.8104306Z if contiguous: 2025-05-07T20:32:16.8104536Z x0 = x0.contiguous() 2025-05-07T20:32:16.8104800Z x1 = x1.contiguous() 2025-05-07T20:32:16.8105031Z 2025-05-07T20:32:16.8105226Z if scale_ub is not None: 2025-05-07T20:32:16.8105498Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.8105828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.8106141Z ) 2025-05-07T20:32:16.8106332Z else: 2025-05-07T20:32:16.8106539Z scale_ub_tensor = None 2025-05-07T20:32:16.8106794Z 2025-05-07T20:32:16.8107027Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.8107333Z op = silu_mul_quant 2025-05-07T20:32:16.8107582Z if compiled: 2025-05-07T20:32:16.8107836Z op = torch.compile(op) 2025-05-07T20:32:16.8108129Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8108408Z 2025-05-07T20:32:16.8108602Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.8108763Z 2025-05-07T20:32:16.8108868Z moe/activation_test.py:117: 2025-05-07T20:32:16.8109161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8109491Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.8109807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8110502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.8111184Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.8111728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.8112403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.8113063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.8113596Z kernel = self.compile( 2025-05-07T20:32:16.8114135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.8114783Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.8115180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8115410Z 2025-05-07T20:32:16.8115616Z self = 2025-05-07T20:32:16.8116675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.8118627Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e5760>} 2025-05-07T20:32:16.8120000Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.8121055Z context = 2025-05-07T20:32:16.8121340Z 2025-05-07T20:32:16.8121509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.8122020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.8122485Z module_map=module_map) 2025-05-07T20:32:16.8122848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.8123199Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.8123453Z E ^ 2025-05-07T20:32:16.8123996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.8124437Z 2025-05-07T20:32:16.8124853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.8125361Z 2025-05-07T20:32:16.8125474Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.8125875Z self=, 2025-05-07T20:32:16.8126274Z T=2048, 2025-05-07T20:32:16.8126463Z D=7168, 2025-05-07T20:32:16.8126652Z scale_ub=None, 2025-05-07T20:32:16.8126872Z contiguous=False, 2025-05-07T20:32:16.8127099Z compiled=True, 2025-05-07T20:32:16.8127302Z ) 2025-05-07T20:32:16.8127617Z self = 2025-05-07T20:32:16.8128106Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:16.8128371Z 2025-05-07T20:32:16.8128458Z @given( 2025-05-07T20:32:16.8128686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:16.8129016Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:16.8129433Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:16.8129895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:16.8130369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:16.8130722Z ) 2025-05-07T20:32:16.8139440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:16.8139940Z def test_silu_mul_quant( 2025-05-07T20:32:16.8140171Z self, 2025-05-07T20:32:16.8140367Z T: int, 2025-05-07T20:32:16.8140566Z D: int, 2025-05-07T20:32:16.8140776Z scale_ub: Optional[float], 2025-05-07T20:32:16.8141042Z contiguous: bool, 2025-05-07T20:32:16.8141275Z compiled: bool, 2025-05-07T20:32:16.8141490Z ) -> None: 2025-05-07T20:32:16.8141706Z torch.manual_seed(2025) 2025-05-07T20:32:16.8141953Z 2025-05-07T20:32:16.8142216Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:16.8142560Z 2025-05-07T20:32:16.8142753Z x_sign = torch.sign(x) 2025-05-07T20:32:16.8143037Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:16.8143347Z x = x_sign * x_clamp 2025-05-07T20:32:16.8143585Z x0 = x[:, :D] 2025-05-07T20:32:16.8143799Z x1 = x[:, D:] 2025-05-07T20:32:16.8143999Z 2025-05-07T20:32:16.8144182Z if contiguous: 2025-05-07T20:32:16.8144409Z x0 = x0.contiguous() 2025-05-07T20:32:16.8144654Z x1 = x1.contiguous() 2025-05-07T20:32:16.8144893Z 2025-05-07T20:32:16.8145162Z if scale_ub is not None: 2025-05-07T20:32:16.8145428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:16.8145761Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:16.8146067Z ) 2025-05-07T20:32:16.8146252Z else: 2025-05-07T20:32:16.8146469Z scale_ub_tensor = None 2025-05-07T20:32:16.8146720Z 2025-05-07T20:32:16.8146942Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:16.8147256Z op = silu_mul_quant 2025-05-07T20:32:16.8147555Z if compiled: 2025-05-07T20:32:16.8147793Z op = torch.compile(op) 2025-05-07T20:32:16.8148088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8148363Z 2025-05-07T20:32:16.8148555Z > y_fp8, y_scale = fn() 2025-05-07T20:32:16.8148714Z 2025-05-07T20:32:16.8148810Z moe/activation_test.py:117: 2025-05-07T20:32:16.8149098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8149421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:16.8149690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:16.8150236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:16.8150890Z return fn(*args, **kwargs) 2025-05-07T20:32:16.8151538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:16.8152215Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:16.8152753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:16.8153416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:16.8154070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:16.8154594Z kernel = self.compile( 2025-05-07T20:32:16.8155125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:16.8155767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:16.8156165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:16.8156391Z 2025-05-07T20:32:16.8156600Z self = 2025-05-07T20:32:16.8157654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:16.8159003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e6f20>} 2025-05-07T20:32:16.8160322Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:16.8161325Z context = 2025-05-07T20:32:16.8161604Z 2025-05-07T20:32:16.8161780Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:16.8162283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:16.8162743Z module_map=module_map) 2025-05-07T20:32:16.8163101Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:16.8163443Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:16.8163703Z E ^ 2025-05-07T20:32:16.8164154Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:16.8164591Z 2025-05-07T20:32:16.8165003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:16.8165556Z 2025-05-07T20:32:16.8165657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:16.8166059Z self=, 2025-05-07T20:32:16.8166461Z T=4096, 2025-05-07T20:32:16.8166644Z D=7168, 2025-05-07T20:32:16.8166841Z scale_ub=None, 2025-05-07T20:32:16.8167060Z contiguous=False, 2025-05-07T20:32:16.8167282Z compiled=True, 2025-05-07T20:32:17.0414745Z ) 2025-05-07T20:32:17.0415275Z self = 2025-05-07T20:32:17.0416023Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.0416398Z 2025-05-07T20:32:17.0416507Z @given( 2025-05-07T20:32:17.0416760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0417072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0417405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0417743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0418080Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0418360Z ) 2025-05-07T20:32:17.0419083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0419531Z def test_silu_mul_quant( 2025-05-07T20:32:17.0419768Z self, 2025-05-07T20:32:17.0419966Z T: int, 2025-05-07T20:32:17.0420165Z D: int, 2025-05-07T20:32:17.0420385Z scale_ub: Optional[float], 2025-05-07T20:32:17.0420658Z contiguous: bool, 2025-05-07T20:32:17.0420896Z compiled: bool, 2025-05-07T20:32:17.0421119Z ) -> None: 2025-05-07T20:32:17.0421340Z torch.manual_seed(2025) 2025-05-07T20:32:17.0421586Z 2025-05-07T20:32:17.0421851Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0422191Z 2025-05-07T20:32:17.0422390Z x_sign = torch.sign(x) 2025-05-07T20:32:17.0422675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.0422989Z x = x_sign * x_clamp 2025-05-07T20:32:17.0423233Z x0 = x[:, :D] 2025-05-07T20:32:17.0423453Z x1 = x[:, D:] 2025-05-07T20:32:17.0423661Z 2025-05-07T20:32:17.0423849Z if contiguous: 2025-05-07T20:32:17.0424083Z x0 = x0.contiguous() 2025-05-07T20:32:17.0424340Z x1 = x1.contiguous() 2025-05-07T20:32:17.0424586Z 2025-05-07T20:32:17.0424782Z if scale_ub is not None: 2025-05-07T20:32:17.0425050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.0425385Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.0425703Z ) 2025-05-07T20:32:17.0425894Z else: 2025-05-07T20:32:17.0426110Z scale_ub_tensor = None 2025-05-07T20:32:17.0426362Z 2025-05-07T20:32:17.0426590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.0426908Z op = silu_mul_quant 2025-05-07T20:32:17.0427158Z if compiled: 2025-05-07T20:32:17.0427401Z op = torch.compile(op) 2025-05-07T20:32:17.0427700Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0427980Z 2025-05-07T20:32:17.0428173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.0428341Z 2025-05-07T20:32:17.0428440Z moe/activation_test.py:117: 2025-05-07T20:32:17.0428738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0429071Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.0429351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0429951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.0430527Z return fn(*args, **kwargs) 2025-05-07T20:32:17.0431176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.0431954Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.0432488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.0433168Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.0433827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.0434356Z kernel = self.compile( 2025-05-07T20:32:17.0434988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.0435642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.0436034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0436267Z 2025-05-07T20:32:17.0436475Z self = 2025-05-07T20:32:17.0437545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.0439001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df00e0>} 2025-05-07T20:32:17.0440377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.0441391Z context = 2025-05-07T20:32:17.0441682Z 2025-05-07T20:32:17.0441849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.0442366Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.0442826Z module_map=module_map) 2025-05-07T20:32:17.0443196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.0443549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.0443810Z E ^ 2025-05-07T20:32:17.0444276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.0444722Z 2025-05-07T20:32:17.0445132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.0445641Z 2025-05-07T20:32:17.0445753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0446159Z self=, 2025-05-07T20:32:17.0446558Z T=16384, 2025-05-07T20:32:17.0446755Z D=5120, 2025-05-07T20:32:17.0446950Z scale_ub=1200.0, 2025-05-07T20:32:17.0447180Z contiguous=False, 2025-05-07T20:32:17.0447411Z compiled=False, 2025-05-07T20:32:17.0447625Z ) 2025-05-07T20:32:17.0447938Z self = 2025-05-07T20:32:17.0448435Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.0448718Z 2025-05-07T20:32:17.0448802Z @given( 2025-05-07T20:32:17.0449030Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0449344Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0449655Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0449977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0450306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0450591Z ) 2025-05-07T20:32:17.0450930Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0451372Z def test_silu_mul_quant( 2025-05-07T20:32:17.0451616Z self, 2025-05-07T20:32:17.0451881Z T: int, 2025-05-07T20:32:17.0452081Z D: int, 2025-05-07T20:32:17.0452297Z scale_ub: Optional[float], 2025-05-07T20:32:17.0452569Z contiguous: bool, 2025-05-07T20:32:17.0452811Z compiled: bool, 2025-05-07T20:32:17.0453031Z ) -> None: 2025-05-07T20:32:17.0453259Z torch.manual_seed(2025) 2025-05-07T20:32:17.0453509Z 2025-05-07T20:32:17.0453908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0454253Z 2025-05-07T20:32:17.0454502Z x_sign = torch.sign(x) 2025-05-07T20:32:17.0454789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.0455102Z x = x_sign * x_clamp 2025-05-07T20:32:17.0455344Z x0 = x[:, :D] 2025-05-07T20:32:17.0455558Z x1 = x[:, D:] 2025-05-07T20:32:17.0455773Z 2025-05-07T20:32:17.0455961Z if contiguous: 2025-05-07T20:32:17.0456188Z x0 = x0.contiguous() 2025-05-07T20:32:17.0456453Z x1 = x1.contiguous() 2025-05-07T20:32:17.0456697Z 2025-05-07T20:32:17.0456879Z if scale_ub is not None: 2025-05-07T20:32:17.0457153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.0457486Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.0457881Z ) 2025-05-07T20:32:17.0458075Z else: 2025-05-07T20:32:17.0458286Z scale_ub_tensor = None 2025-05-07T20:32:17.0458539Z 2025-05-07T20:32:17.0458764Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.0459083Z op = silu_mul_quant 2025-05-07T20:32:17.0459332Z if compiled: 2025-05-07T20:32:17.0459589Z op = torch.compile(op) 2025-05-07T20:32:17.0459925Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0460201Z 2025-05-07T20:32:17.0460392Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.0460562Z 2025-05-07T20:32:17.0460664Z moe/activation_test.py:117: 2025-05-07T20:32:17.0460961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0461291Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.0461571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0462256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.0462939Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.0463467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.0464145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.0464806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.0465346Z kernel = self.compile( 2025-05-07T20:32:17.0465880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.0466536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.0466934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0467161Z 2025-05-07T20:32:17.0467373Z self = 2025-05-07T20:32:17.0468437Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.0469814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df0b80>} 2025-05-07T20:32:17.0471178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.0472257Z context = 2025-05-07T20:32:17.0472542Z 2025-05-07T20:32:17.0472707Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.0473233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.0473694Z module_map=module_map) 2025-05-07T20:32:17.0474062Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.0474452Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.0474711Z E ^ 2025-05-07T20:32:17.0475170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.0475610Z 2025-05-07T20:32:17.0476019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.0476532Z 2025-05-07T20:32:17.0476636Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.0477052Z self=, 2025-05-07T20:32:17.0477454Z T=16384, 2025-05-07T20:32:17.0477642Z D=5120, 2025-05-07T20:32:17.0477838Z scale_ub=1200.0, 2025-05-07T20:32:17.0478174Z contiguous=True, 2025-05-07T20:32:17.0478391Z compiled=True, 2025-05-07T20:32:17.0478595Z ) 2025-05-07T20:32:17.0478915Z self = 2025-05-07T20:32:17.0479405Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.0479678Z 2025-05-07T20:32:17.0479754Z @given( 2025-05-07T20:32:17.0479987Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.0480291Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.0480597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.0480926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.0481255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.0481534Z ) 2025-05-07T20:32:17.0481880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.0482321Z def test_silu_mul_quant( 2025-05-07T20:32:17.0482563Z self, 2025-05-07T20:32:17.0482759Z T: int, 2025-05-07T20:32:17.0482957Z D: int, 2025-05-07T20:32:17.0483169Z scale_ub: Optional[float], 2025-05-07T20:32:17.0483441Z contiguous: bool, 2025-05-07T20:32:17.0483682Z compiled: bool, 2025-05-07T20:32:17.0483897Z ) -> None: 2025-05-07T20:32:17.0484113Z torch.manual_seed(2025) 2025-05-07T20:32:17.0484353Z 2025-05-07T20:32:17.0484617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.0484960Z 2025-05-07T20:32:17.0485153Z x_sign = torch.sign(x) 2025-05-07T20:32:17.0485445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.0485749Z x = x_sign * x_clamp 2025-05-07T20:32:17.0485988Z x0 = x[:, :D] 2025-05-07T20:32:17.0486206Z x1 = x[:, D:] 2025-05-07T20:32:17.0486412Z 2025-05-07T20:32:17.0486601Z if contiguous: 2025-05-07T20:32:17.0486837Z x0 = x0.contiguous() 2025-05-07T20:32:17.0487096Z x1 = x1.contiguous() 2025-05-07T20:32:17.0487343Z 2025-05-07T20:32:17.0487539Z if scale_ub is not None: 2025-05-07T20:32:17.0487807Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.0488147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.0488460Z ) 2025-05-07T20:32:17.0488647Z else: 2025-05-07T20:32:17.0488858Z scale_ub_tensor = None 2025-05-07T20:32:17.0489114Z 2025-05-07T20:32:17.0489342Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.0489661Z op = silu_mul_quant 2025-05-07T20:32:17.0489917Z if compiled: 2025-05-07T20:32:17.0490221Z op = torch.compile(op) 2025-05-07T20:32:17.0490515Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0490789Z 2025-05-07T20:32:17.0490994Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.0491159Z 2025-05-07T20:32:17.0491255Z moe/activation_test.py:117: 2025-05-07T20:32:17.0491554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0491882Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.0492156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.0492753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.0493305Z return fn(*args, **kwargs) 2025-05-07T20:32:17.0494060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.0494734Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.0495272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.0495944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.0496676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.0497209Z kernel = self.compile( 2025-05-07T20:32:17.0497749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.0498706Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.0499098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.0499336Z 2025-05-07T20:32:17.0499542Z self = 2025-05-07T20:32:17.0500611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.0501970Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df22a0>} 2025-05-07T20:32:17.0504474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.0505485Z context = 2025-05-07T20:32:17.0505774Z 2025-05-07T20:32:17.0505941Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.0506458Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.0506914Z module_map=module_map) 2025-05-07T20:32:17.0507283Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.0507635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.0507902Z E ^ 2025-05-07T20:32:17.0508362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.0508809Z 2025-05-07T20:32:17.0509217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2062218Z 2025-05-07T20:32:17.2062823Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2064025Z self=, 2025-05-07T20:32:17.2065116Z T=16384, 2025-05-07T20:32:17.2065584Z D=5120, 2025-05-07T20:32:17.2065974Z scale_ub=None, 2025-05-07T20:32:17.2066447Z contiguous=False, 2025-05-07T20:32:17.2066889Z compiled=True, 2025-05-07T20:32:17.2067301Z ) 2025-05-07T20:32:17.2068300Z self = 2025-05-07T20:32:17.2069290Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.2069719Z 2025-05-07T20:32:17.2069798Z @given( 2025-05-07T20:32:17.2070046Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.2070361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.2070663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.2070999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.2071422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.2071714Z ) 2025-05-07T20:32:17.2072064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.2072507Z def test_silu_mul_quant( 2025-05-07T20:32:17.2072746Z self, 2025-05-07T20:32:17.2072939Z T: int, 2025-05-07T20:32:17.2073138Z D: int, 2025-05-07T20:32:17.2073357Z scale_ub: Optional[float], 2025-05-07T20:32:17.2073629Z contiguous: bool, 2025-05-07T20:32:17.2073869Z compiled: bool, 2025-05-07T20:32:17.2074100Z ) -> None: 2025-05-07T20:32:17.2074313Z torch.manual_seed(2025) 2025-05-07T20:32:17.2074557Z 2025-05-07T20:32:17.2074971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.2075312Z 2025-05-07T20:32:17.2075514Z x_sign = torch.sign(x) 2025-05-07T20:32:17.2075806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.2076117Z x = x_sign * x_clamp 2025-05-07T20:32:17.2076362Z x0 = x[:, :D] 2025-05-07T20:32:17.2076583Z x1 = x[:, D:] 2025-05-07T20:32:17.2076789Z 2025-05-07T20:32:17.2076978Z if contiguous: 2025-05-07T20:32:17.2077221Z x0 = x0.contiguous() 2025-05-07T20:32:17.2077477Z x1 = x1.contiguous() 2025-05-07T20:32:17.2077720Z 2025-05-07T20:32:17.2077916Z if scale_ub is not None: 2025-05-07T20:32:17.2078199Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.2078528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.2078841Z ) 2025-05-07T20:32:17.2079037Z else: 2025-05-07T20:32:17.2079246Z scale_ub_tensor = None 2025-05-07T20:32:17.2079526Z 2025-05-07T20:32:17.2079766Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2080126Z op = silu_mul_quant 2025-05-07T20:32:17.2080390Z if compiled: 2025-05-07T20:32:17.2080773Z op = torch.compile(op) 2025-05-07T20:32:17.2081163Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2081745Z 2025-05-07T20:32:17.2090284Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.2090488Z 2025-05-07T20:32:17.2090599Z moe/activation_test.py:117: 2025-05-07T20:32:17.2090910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2091254Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.2091546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2092115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.2092682Z return fn(*args, **kwargs) 2025-05-07T20:32:17.2093343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.2094731Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.2095277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2095965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2096627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2097171Z kernel = self.compile( 2025-05-07T20:32:17.2097723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2098755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2099156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2099402Z 2025-05-07T20:32:17.2099613Z self = 2025-05-07T20:32:17.2100686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2102156Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df3060>} 2025-05-07T20:32:17.2103485Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2104509Z context = 2025-05-07T20:32:17.2104804Z 2025-05-07T20:32:17.2105125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2105652Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2106112Z module_map=module_map) 2025-05-07T20:32:17.2106487Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2106853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.2107116Z E ^ 2025-05-07T20:32:17.2107586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2108038Z 2025-05-07T20:32:17.2108452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.2108965Z 2025-05-07T20:32:17.2109080Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.2109492Z self=, 2025-05-07T20:32:17.2109911Z T=2048, 2025-05-07T20:32:17.2110116Z D=5120, 2025-05-07T20:32:17.2110326Z scale_ub=None, 2025-05-07T20:32:17.2110546Z contiguous=False, 2025-05-07T20:32:17.2110775Z compiled=True, 2025-05-07T20:32:17.2110982Z ) 2025-05-07T20:32:17.2111308Z self = 2025-05-07T20:32:17.2111805Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:17.2112075Z 2025-05-07T20:32:17.2112154Z @given( 2025-05-07T20:32:17.2112394Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.2112717Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.2113021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.2113360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.2113693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.2113980Z ) 2025-05-07T20:32:17.2114334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.2114786Z def test_silu_mul_quant( 2025-05-07T20:32:17.2115034Z self, 2025-05-07T20:32:17.2115227Z T: int, 2025-05-07T20:32:17.2115427Z D: int, 2025-05-07T20:32:17.2115649Z scale_ub: Optional[float], 2025-05-07T20:32:17.2115926Z contiguous: bool, 2025-05-07T20:32:17.2116169Z compiled: bool, 2025-05-07T20:32:17.2116396Z ) -> None: 2025-05-07T20:32:17.2116609Z torch.manual_seed(2025) 2025-05-07T20:32:17.2116857Z 2025-05-07T20:32:17.2117135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.2117474Z 2025-05-07T20:32:17.2117671Z x_sign = torch.sign(x) 2025-05-07T20:32:17.2118043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.2118353Z x = x_sign * x_clamp 2025-05-07T20:32:17.2118599Z x0 = x[:, :D] 2025-05-07T20:32:17.2118819Z x1 = x[:, D:] 2025-05-07T20:32:17.2119025Z 2025-05-07T20:32:17.2119215Z if contiguous: 2025-05-07T20:32:17.2119460Z x0 = x0.contiguous() 2025-05-07T20:32:17.2119727Z x1 = x1.contiguous() 2025-05-07T20:32:17.2119965Z 2025-05-07T20:32:17.2120164Z if scale_ub is not None: 2025-05-07T20:32:17.2120490Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.2120815Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.2121124Z ) 2025-05-07T20:32:17.2121325Z else: 2025-05-07T20:32:17.2121535Z scale_ub_tensor = None 2025-05-07T20:32:17.2121792Z 2025-05-07T20:32:17.2122029Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.2122341Z op = silu_mul_quant 2025-05-07T20:32:17.2122604Z if compiled: 2025-05-07T20:32:17.2122859Z op = torch.compile(op) 2025-05-07T20:32:17.2123154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2123438Z 2025-05-07T20:32:17.2123635Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.2123891Z 2025-05-07T20:32:17.2123994Z moe/activation_test.py:117: 2025-05-07T20:32:17.2124285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2124621Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.2124911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.2125462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.2126026Z return fn(*args, **kwargs) 2025-05-07T20:32:17.2126689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.2127370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.2127911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.2128598Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.2129266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.2129797Z kernel = self.compile( 2025-05-07T20:32:17.2130350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.2131010Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.2131420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.2131649Z 2025-05-07T20:32:17.2131856Z self = 2025-05-07T20:32:17.2132927Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.2134421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda51507c0>} 2025-05-07T20:32:17.2135755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.2136764Z context = 2025-05-07T20:32:17.2137058Z 2025-05-07T20:32:17.2137225Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.2137749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.2138272Z module_map=module_map) 2025-05-07T20:32:17.2138633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.2138991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.2139258Z E ^ 2025-05-07T20:32:17.2139744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.2140221Z 2025-05-07T20:32:17.2140630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5790322Z 2025-05-07T20:32:17.5790796Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5791454Z self=, 2025-05-07T20:32:17.5792002Z T=2048, 2025-05-07T20:32:17.5792310Z D=5120, 2025-05-07T20:32:17.5792533Z scale_ub=1200.0, 2025-05-07T20:32:17.5792757Z contiguous=False, 2025-05-07T20:32:17.5792985Z compiled=True, 2025-05-07T20:32:17.5793218Z ) 2025-05-07T20:32:17.5793535Z self = 2025-05-07T20:32:17.5794036Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.5794313Z 2025-05-07T20:32:17.5794754Z @given( 2025-05-07T20:32:17.5794995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5795307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5795615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5795953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5796273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5796563Z ) 2025-05-07T20:32:17.5796913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5797348Z def test_silu_mul_quant( 2025-05-07T20:32:17.5797592Z self, 2025-05-07T20:32:17.5797788Z T: int, 2025-05-07T20:32:17.5797982Z D: int, 2025-05-07T20:32:17.5798510Z scale_ub: Optional[float], 2025-05-07T20:32:17.5798788Z contiguous: bool, 2025-05-07T20:32:17.5799031Z compiled: bool, 2025-05-07T20:32:17.5799256Z ) -> None: 2025-05-07T20:32:17.5799473Z torch.manual_seed(2025) 2025-05-07T20:32:17.5799730Z 2025-05-07T20:32:17.5799998Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5800345Z 2025-05-07T20:32:17.5800543Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5800832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5801144Z x = x_sign * x_clamp 2025-05-07T20:32:17.5801387Z x0 = x[:, :D] 2025-05-07T20:32:17.5801597Z x1 = x[:, D:] 2025-05-07T20:32:17.5801808Z 2025-05-07T20:32:17.5801997Z if contiguous: 2025-05-07T20:32:17.5802228Z x0 = x0.contiguous() 2025-05-07T20:32:17.5802490Z x1 = x1.contiguous() 2025-05-07T20:32:17.5802732Z 2025-05-07T20:32:17.5802925Z if scale_ub is not None: 2025-05-07T20:32:17.5803201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5803541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5803851Z ) 2025-05-07T20:32:17.5804046Z else: 2025-05-07T20:32:17.5804265Z scale_ub_tensor = None 2025-05-07T20:32:17.5804522Z 2025-05-07T20:32:17.5804749Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5805073Z op = silu_mul_quant 2025-05-07T20:32:17.5805331Z if compiled: 2025-05-07T20:32:17.5805576Z op = torch.compile(op) 2025-05-07T20:32:17.5805877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5806158Z 2025-05-07T20:32:17.5806348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5806520Z 2025-05-07T20:32:17.5806618Z moe/activation_test.py:117: 2025-05-07T20:32:17.5806929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5807365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5807642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5808203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5808765Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5809416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5810234Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5810780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5811462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5812118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5812656Z kernel = self.compile( 2025-05-07T20:32:17.5813206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5813955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5814496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5814737Z 2025-05-07T20:32:17.5814944Z self = 2025-05-07T20:32:17.5816013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5817391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5151580>} 2025-05-07T20:32:17.5818711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5819726Z context = 2025-05-07T20:32:17.5820022Z 2025-05-07T20:32:17.5820187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5820706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5821167Z module_map=module_map) 2025-05-07T20:32:17.5821533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5821888Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5822144Z E ^ 2025-05-07T20:32:17.5822609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5823052Z 2025-05-07T20:32:17.5823465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.5823969Z 2025-05-07T20:32:17.5824080Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.5824491Z self=, 2025-05-07T20:32:17.5824896Z T=4096, 2025-05-07T20:32:17.5825090Z D=5120, 2025-05-07T20:32:17.5825282Z scale_ub=1200.0, 2025-05-07T20:32:17.5825513Z contiguous=True, 2025-05-07T20:32:17.5825736Z compiled=True, 2025-05-07T20:32:17.5825938Z ) 2025-05-07T20:32:17.5826260Z self = 2025-05-07T20:32:17.5826753Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.5827025Z 2025-05-07T20:32:17.5827108Z @given( 2025-05-07T20:32:17.5827334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.5827648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.5828017Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.5828344Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.5828676Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.5828968Z ) 2025-05-07T20:32:17.5829317Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.5829759Z def test_silu_mul_quant( 2025-05-07T20:32:17.5830002Z self, 2025-05-07T20:32:17.5830200Z T: int, 2025-05-07T20:32:17.5830448Z D: int, 2025-05-07T20:32:17.5830671Z scale_ub: Optional[float], 2025-05-07T20:32:17.5830945Z contiguous: bool, 2025-05-07T20:32:17.5831181Z compiled: bool, 2025-05-07T20:32:17.5831407Z ) -> None: 2025-05-07T20:32:17.5831624Z torch.manual_seed(2025) 2025-05-07T20:32:17.5831860Z 2025-05-07T20:32:17.5832135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.5832478Z 2025-05-07T20:32:17.5832671Z x_sign = torch.sign(x) 2025-05-07T20:32:17.5832964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.5833274Z x = x_sign * x_clamp 2025-05-07T20:32:17.5833510Z x0 = x[:, :D] 2025-05-07T20:32:17.5833732Z x1 = x[:, D:] 2025-05-07T20:32:17.5834060Z 2025-05-07T20:32:17.5834244Z if contiguous: 2025-05-07T20:32:17.5834482Z x0 = x0.contiguous() 2025-05-07T20:32:17.5834743Z x1 = x1.contiguous() 2025-05-07T20:32:17.5834976Z 2025-05-07T20:32:17.5835177Z if scale_ub is not None: 2025-05-07T20:32:17.5835450Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.5835785Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.5836090Z ) 2025-05-07T20:32:17.5836285Z else: 2025-05-07T20:32:17.5836496Z scale_ub_tensor = None 2025-05-07T20:32:17.5836743Z 2025-05-07T20:32:17.5836981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.5837300Z op = silu_mul_quant 2025-05-07T20:32:17.5837545Z if compiled: 2025-05-07T20:32:17.5837794Z op = torch.compile(op) 2025-05-07T20:32:17.5838096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5838373Z 2025-05-07T20:32:17.5838574Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.5838738Z 2025-05-07T20:32:17.5838842Z moe/activation_test.py:117: 2025-05-07T20:32:17.5839138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5839474Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.5839759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.5840360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.5840906Z return fn(*args, **kwargs) 2025-05-07T20:32:17.5841556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.5842240Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.5842771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.5843453Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.5844122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.5844657Z kernel = self.compile( 2025-05-07T20:32:17.5845196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.5845853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.5846255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.5846479Z 2025-05-07T20:32:17.5846697Z self = 2025-05-07T20:32:17.5847812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.5849171Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5152840>} 2025-05-07T20:32:17.5850502Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.5851561Z context = 2025-05-07T20:32:17.5851849Z 2025-05-07T20:32:17.5852015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.5852542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.5853017Z module_map=module_map) 2025-05-07T20:32:17.5853387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.5853848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.5854202Z E ^ 2025-05-07T20:32:17.5854671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.5855114Z 2025-05-07T20:32:17.5855532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.7555019Z 2025-05-07T20:32:17.7555438Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.7556108Z self=, 2025-05-07T20:32:17.7556663Z T=128, 2025-05-07T20:32:17.7556920Z D=5120, 2025-05-07T20:32:17.7557168Z scale_ub=1200.0, 2025-05-07T20:32:17.7557419Z contiguous=False, 2025-05-07T20:32:17.7557644Z compiled=True, 2025-05-07T20:32:17.7557848Z ) 2025-05-07T20:32:17.7558168Z self = 2025-05-07T20:32:17.7558669Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:17.7558942Z 2025-05-07T20:32:17.7559026Z @given( 2025-05-07T20:32:17.7559253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.7559571Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.7559897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.7560256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.7560582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.7560869Z ) 2025-05-07T20:32:17.7561212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.7561656Z def test_silu_mul_quant( 2025-05-07T20:32:17.7561904Z self, 2025-05-07T20:32:17.7562102Z T: int, 2025-05-07T20:32:17.7562295Z D: int, 2025-05-07T20:32:17.7562515Z scale_ub: Optional[float], 2025-05-07T20:32:17.7562787Z contiguous: bool, 2025-05-07T20:32:17.7563020Z compiled: bool, 2025-05-07T20:32:17.7563247Z ) -> None: 2025-05-07T20:32:17.7563474Z torch.manual_seed(2025) 2025-05-07T20:32:17.7563708Z 2025-05-07T20:32:17.7563984Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.7564336Z 2025-05-07T20:32:17.7564529Z x_sign = torch.sign(x) 2025-05-07T20:32:17.7564822Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.7565138Z x = x_sign * x_clamp 2025-05-07T20:32:17.7565374Z x0 = x[:, :D] 2025-05-07T20:32:17.7565593Z x1 = x[:, D:] 2025-05-07T20:32:17.7565805Z 2025-05-07T20:32:17.7565986Z if contiguous: 2025-05-07T20:32:17.7566220Z x0 = x0.contiguous() 2025-05-07T20:32:17.7566758Z x1 = x1.contiguous() 2025-05-07T20:32:17.7566989Z 2025-05-07T20:32:17.7567187Z if scale_ub is not None: 2025-05-07T20:32:17.7567463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.7567800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.7568110Z ) 2025-05-07T20:32:17.7568307Z else: 2025-05-07T20:32:17.7568518Z scale_ub_tensor = None 2025-05-07T20:32:17.7568765Z 2025-05-07T20:32:17.7568996Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.7569414Z op = silu_mul_quant 2025-05-07T20:32:17.7569658Z if compiled: 2025-05-07T20:32:17.7569905Z op = torch.compile(op) 2025-05-07T20:32:17.7570204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.7570472Z 2025-05-07T20:32:17.7570667Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.7570832Z 2025-05-07T20:32:17.7570936Z moe/activation_test.py:117: 2025-05-07T20:32:17.7571231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7571561Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.7571842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.7572545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.7573098Z return fn(*args, **kwargs) 2025-05-07T20:32:17.7573922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.7574605Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.7575133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.7575807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.7576467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.7577001Z kernel = self.compile( 2025-05-07T20:32:17.7577532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.7578193Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.7578589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7578818Z 2025-05-07T20:32:17.7579031Z self = 2025-05-07T20:32:17.7580138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.7581511Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda51534c0>} 2025-05-07T20:32:17.7582839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.7583857Z context = 2025-05-07T20:32:17.7584141Z 2025-05-07T20:32:17.7584312Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.7584827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.7585293Z module_map=module_map) 2025-05-07T20:32:17.7585654Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.7586018Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.7586287Z E ^ 2025-05-07T20:32:17.7586744Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.7587242Z 2025-05-07T20:32:17.7587660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.7588166Z 2025-05-07T20:32:17.7588269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.7588686Z self=, 2025-05-07T20:32:17.7589092Z T=16384, 2025-05-07T20:32:17.7589280Z D=7168, 2025-05-07T20:32:17.7589652Z scale_ub=1200.0, 2025-05-07T20:32:17.7590200Z contiguous=True, 2025-05-07T20:32:17.7598809Z compiled=True, 2025-05-07T20:32:17.7599029Z ) 2025-05-07T20:32:17.7599362Z self = 2025-05-07T20:32:17.7599856Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:17.7600138Z 2025-05-07T20:32:17.7600219Z @given( 2025-05-07T20:32:17.7600459Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.7600787Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.7601102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.7601441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.7601769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.7602249Z ) 2025-05-07T20:32:17.7602612Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.7603069Z def test_silu_mul_quant( 2025-05-07T20:32:17.7603310Z self, 2025-05-07T20:32:17.7603518Z T: int, 2025-05-07T20:32:17.7603726Z D: int, 2025-05-07T20:32:17.7603944Z scale_ub: Optional[float], 2025-05-07T20:32:17.7604224Z contiguous: bool, 2025-05-07T20:32:17.7604469Z compiled: bool, 2025-05-07T20:32:17.7604694Z ) -> None: 2025-05-07T20:32:17.7604916Z torch.manual_seed(2025) 2025-05-07T20:32:17.7605164Z 2025-05-07T20:32:17.7605456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.7605810Z 2025-05-07T20:32:17.7606012Z x_sign = torch.sign(x) 2025-05-07T20:32:17.7606309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.7606618Z x = x_sign * x_clamp 2025-05-07T20:32:17.7606869Z x0 = x[:, :D] 2025-05-07T20:32:17.7607091Z x1 = x[:, D:] 2025-05-07T20:32:17.7607297Z 2025-05-07T20:32:17.7607488Z if contiguous: 2025-05-07T20:32:17.7607726Z x0 = x0.contiguous() 2025-05-07T20:32:17.7607986Z x1 = x1.contiguous() 2025-05-07T20:32:17.7608236Z 2025-05-07T20:32:17.7608436Z if scale_ub is not None: 2025-05-07T20:32:17.7608710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.7609047Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.7609360Z ) 2025-05-07T20:32:17.7609551Z else: 2025-05-07T20:32:17.7609769Z scale_ub_tensor = None 2025-05-07T20:32:17.7610057Z 2025-05-07T20:32:17.7610305Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.7610628Z op = silu_mul_quant 2025-05-07T20:32:17.7610882Z if compiled: 2025-05-07T20:32:17.7611133Z op = torch.compile(op) 2025-05-07T20:32:17.7611435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.7611716Z 2025-05-07T20:32:17.7611913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.7612077Z 2025-05-07T20:32:17.7612176Z moe/activation_test.py:117: 2025-05-07T20:32:17.7612477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7612815Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.7613097Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.7613746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:17.7614306Z return fn(*args, **kwargs) 2025-05-07T20:32:17.7614965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.7615717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.7616257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.7616931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.7617583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.7618187Z kernel = self.compile( 2025-05-07T20:32:17.7618730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.7619382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.7619779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.7620014Z 2025-05-07T20:32:17.7620226Z self = 2025-05-07T20:32:17.7621377Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.7622742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4378c20>} 2025-05-07T20:32:17.7624076Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.7625081Z context = 2025-05-07T20:32:17.7625373Z 2025-05-07T20:32:17.7625540Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.7626061Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.7626516Z module_map=module_map) 2025-05-07T20:32:17.7626879Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.7627235Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.7627494Z E ^ 2025-05-07T20:32:17.7627943Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.7628391Z 2025-05-07T20:32:17.7628801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8781504Z 2025-05-07T20:32:17.8781891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8782513Z self=, 2025-05-07T20:32:17.8783108Z T=16384, 2025-05-07T20:32:17.8783387Z D=5120, 2025-05-07T20:32:17.8783675Z scale_ub=1200.0, 2025-05-07T20:32:17.8783970Z contiguous=True, 2025-05-07T20:32:17.8784270Z compiled=False, 2025-05-07T20:32:17.8784533Z ) 2025-05-07T20:32:17.8784943Z self = 2025-05-07T20:32:17.8785580Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:17.8785856Z 2025-05-07T20:32:17.8785937Z @given( 2025-05-07T20:32:17.8786174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8786497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8786800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8787136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8787467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8787759Z ) 2025-05-07T20:32:17.8788105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8788818Z def test_silu_mul_quant( 2025-05-07T20:32:17.8789064Z self, 2025-05-07T20:32:17.8789259Z T: int, 2025-05-07T20:32:17.8789461Z D: int, 2025-05-07T20:32:17.8789682Z scale_ub: Optional[float], 2025-05-07T20:32:17.8789949Z contiguous: bool, 2025-05-07T20:32:17.8790198Z compiled: bool, 2025-05-07T20:32:17.8790425Z ) -> None: 2025-05-07T20:32:17.8790637Z torch.manual_seed(2025) 2025-05-07T20:32:17.8790883Z 2025-05-07T20:32:17.8791159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8791590Z 2025-05-07T20:32:17.8791787Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8792086Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8792403Z x = x_sign * x_clamp 2025-05-07T20:32:17.8792639Z x0 = x[:, :D] 2025-05-07T20:32:17.8792861Z x1 = x[:, D:] 2025-05-07T20:32:17.8793075Z 2025-05-07T20:32:17.8793259Z if contiguous: 2025-05-07T20:32:17.8793499Z x0 = x0.contiguous() 2025-05-07T20:32:17.8793763Z x1 = x1.contiguous() 2025-05-07T20:32:17.8794001Z 2025-05-07T20:32:17.8794195Z if scale_ub is not None: 2025-05-07T20:32:17.8794471Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8794988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8795305Z ) 2025-05-07T20:32:17.8795504Z else: 2025-05-07T20:32:17.8795713Z scale_ub_tensor = None 2025-05-07T20:32:17.8795974Z 2025-05-07T20:32:17.8796215Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8796529Z op = silu_mul_quant 2025-05-07T20:32:17.8796783Z if compiled: 2025-05-07T20:32:17.8797034Z op = torch.compile(op) 2025-05-07T20:32:17.8797330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8797610Z 2025-05-07T20:32:17.8797810Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8797977Z 2025-05-07T20:32:17.8798084Z moe/activation_test.py:117: 2025-05-07T20:32:17.8798728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8799070Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8799357Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8800049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8800737Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8801278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8801955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8802610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8803145Z kernel = self.compile( 2025-05-07T20:32:17.8803691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8804346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8804749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8804990Z 2025-05-07T20:32:17.8805198Z self = 2025-05-07T20:32:17.8806266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8807635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4379580>} 2025-05-07T20:32:17.8808960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8810089Z context = 2025-05-07T20:32:17.8810400Z 2025-05-07T20:32:17.8810582Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8811100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8811562Z module_map=module_map) 2025-05-07T20:32:17.8812002Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8812360Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8812621Z E ^ 2025-05-07T20:32:17.8813087Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8813529Z 2025-05-07T20:32:17.8814080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8814592Z 2025-05-07T20:32:17.8814709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8815118Z self=, 2025-05-07T20:32:17.8815520Z T=1, 2025-05-07T20:32:17.8815836Z D=7168, 2025-05-07T20:32:17.8816033Z scale_ub=1200.0, 2025-05-07T20:32:17.8816262Z contiguous=False, 2025-05-07T20:32:17.8816498Z compiled=False, 2025-05-07T20:32:17.8816704Z ) 2025-05-07T20:32:17.8817028Z self = 2025-05-07T20:32:17.8817516Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:17.8817779Z 2025-05-07T20:32:17.8817865Z @given( 2025-05-07T20:32:17.8818096Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:17.8818413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:17.8818728Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:17.8819057Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:17.8819388Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:17.8819679Z ) 2025-05-07T20:32:17.8820028Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:17.8820473Z def test_silu_mul_quant( 2025-05-07T20:32:17.8820719Z self, 2025-05-07T20:32:17.8820913Z T: int, 2025-05-07T20:32:17.8821119Z D: int, 2025-05-07T20:32:17.8821352Z scale_ub: Optional[float], 2025-05-07T20:32:17.8821629Z contiguous: bool, 2025-05-07T20:32:17.8821874Z compiled: bool, 2025-05-07T20:32:17.8822097Z ) -> None: 2025-05-07T20:32:17.8822321Z torch.manual_seed(2025) 2025-05-07T20:32:17.8822567Z 2025-05-07T20:32:17.8822840Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:17.8823190Z 2025-05-07T20:32:17.8823388Z x_sign = torch.sign(x) 2025-05-07T20:32:17.8823685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:17.8823993Z x = x_sign * x_clamp 2025-05-07T20:32:17.8824237Z x0 = x[:, :D] 2025-05-07T20:32:17.8824458Z x1 = x[:, D:] 2025-05-07T20:32:17.8824665Z 2025-05-07T20:32:17.8824863Z if contiguous: 2025-05-07T20:32:17.8825100Z x0 = x0.contiguous() 2025-05-07T20:32:17.8825356Z x1 = x1.contiguous() 2025-05-07T20:32:17.8825599Z 2025-05-07T20:32:17.8825797Z if scale_ub is not None: 2025-05-07T20:32:17.8826071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:17.8826409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:17.8826719Z ) 2025-05-07T20:32:17.8826909Z else: 2025-05-07T20:32:17.8827124Z scale_ub_tensor = None 2025-05-07T20:32:17.8827380Z 2025-05-07T20:32:17.8827608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:17.8827924Z op = silu_mul_quant 2025-05-07T20:32:17.8828232Z if compiled: 2025-05-07T20:32:17.8828482Z op = torch.compile(op) 2025-05-07T20:32:17.8828776Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8829056Z 2025-05-07T20:32:17.8829253Z > y_fp8, y_scale = fn() 2025-05-07T20:32:17.8829422Z 2025-05-07T20:32:17.8829524Z moe/activation_test.py:117: 2025-05-07T20:32:17.8829823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8830158Z moe/activation_test.py:115: in fn 2025-05-07T20:32:17.8830483Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:17.8831166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:17.8831852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:17.8832391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:17.8833067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:17.8833734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:17.8834271Z kernel = self.compile( 2025-05-07T20:32:17.8834885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:17.8835549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:17.8835953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:17.8836183Z 2025-05-07T20:32:17.8836396Z self = 2025-05-07T20:32:17.8837457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:17.8838814Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda437a8e0>} 2025-05-07T20:32:17.8840153Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:17.8841169Z context = 2025-05-07T20:32:17.8841456Z 2025-05-07T20:32:17.8841630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:17.8842144Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:17.8842614Z module_map=module_map) 2025-05-07T20:32:17.8842980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:17.8843331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:17.8843598Z E ^ 2025-05-07T20:32:17.8844062Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:17.8844506Z 2025-05-07T20:32:17.8844923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:17.8845429Z 2025-05-07T20:32:17.8845534Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:17.8845948Z self=, 2025-05-07T20:32:17.8846352Z T=4096, 2025-05-07T20:32:17.8846538Z D=7168, 2025-05-07T20:32:17.8846735Z scale_ub=1200.0, 2025-05-07T20:32:17.8846963Z contiguous=False, 2025-05-07T20:32:17.8847185Z compiled=True, 2025-05-07T20:32:18.0470790Z ) 2025-05-07T20:32:18.0471811Z self = 2025-05-07T20:32:18.0473163Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.0474186Z 2025-05-07T20:32:18.0474345Z @given( 2025-05-07T20:32:18.0474805Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0475415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0476032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0476682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0477318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0478040Z ) 2025-05-07T20:32:18.0478723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0479590Z def test_silu_mul_quant( 2025-05-07T20:32:18.0480038Z self, 2025-05-07T20:32:18.0480256Z T: int, 2025-05-07T20:32:18.0480472Z D: int, 2025-05-07T20:32:18.0480690Z scale_ub: Optional[float], 2025-05-07T20:32:18.0480959Z contiguous: bool, 2025-05-07T20:32:18.0481205Z compiled: bool, 2025-05-07T20:32:18.0481435Z ) -> None: 2025-05-07T20:32:18.0481656Z torch.manual_seed(2025) 2025-05-07T20:32:18.0481899Z 2025-05-07T20:32:18.0482166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0482511Z 2025-05-07T20:32:18.0482865Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0483153Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0483462Z x = x_sign * x_clamp 2025-05-07T20:32:18.0483700Z x0 = x[:, :D] 2025-05-07T20:32:18.0483912Z x1 = x[:, D:] 2025-05-07T20:32:18.0484121Z 2025-05-07T20:32:18.0484308Z if contiguous: 2025-05-07T20:32:18.0484533Z x0 = x0.contiguous() 2025-05-07T20:32:18.0484790Z x1 = x1.contiguous() 2025-05-07T20:32:18.0485032Z 2025-05-07T20:32:18.0485221Z if scale_ub is not None: 2025-05-07T20:32:18.0485497Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.0485832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.0486147Z ) 2025-05-07T20:32:18.0486341Z else: 2025-05-07T20:32:18.0486557Z scale_ub_tensor = None 2025-05-07T20:32:18.0486815Z 2025-05-07T20:32:18.0487044Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.0487366Z op = silu_mul_quant 2025-05-07T20:32:18.0487620Z if compiled: 2025-05-07T20:32:18.0487865Z op = torch.compile(op) 2025-05-07T20:32:18.0488165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0488448Z 2025-05-07T20:32:18.0488636Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.0488806Z 2025-05-07T20:32:18.0488903Z moe/activation_test.py:117: 2025-05-07T20:32:18.0489201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0489533Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.0489826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0490390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.0490952Z return fn(*args, **kwargs) 2025-05-07T20:32:18.0491603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.0492287Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.0492822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.0493498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.0494315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.0494850Z kernel = self.compile( 2025-05-07T20:32:18.0495395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.0496129Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.0496527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0496763Z 2025-05-07T20:32:18.0496972Z self = 2025-05-07T20:32:18.0498065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.0499836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda437ba60>} 2025-05-07T20:32:18.0501164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.0502179Z context = 2025-05-07T20:32:18.0502461Z 2025-05-07T20:32:18.0502631Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.0503292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.0503755Z module_map=module_map) 2025-05-07T20:32:18.0504121Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.0504481Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.0504737Z E ^ 2025-05-07T20:32:18.0505195Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.0505639Z 2025-05-07T20:32:18.0506055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.0506561Z 2025-05-07T20:32:18.0506676Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0507080Z self=, 2025-05-07T20:32:18.0507476Z T=128, 2025-05-07T20:32:18.0507665Z D=7168, 2025-05-07T20:32:18.0507883Z scale_ub=1200.0, 2025-05-07T20:32:18.0508116Z contiguous=False, 2025-05-07T20:32:18.0508343Z compiled=True, 2025-05-07T20:32:18.0508545Z ) 2025-05-07T20:32:18.0508857Z self = 2025-05-07T20:32:18.0509352Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:18.0509623Z 2025-05-07T20:32:18.0509709Z @given( 2025-05-07T20:32:18.0509964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.0510307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.0510616Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.0510947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.0511275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.0511565Z ) 2025-05-07T20:32:18.0511912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.0512347Z def test_silu_mul_quant( 2025-05-07T20:32:18.0512592Z self, 2025-05-07T20:32:18.0512795Z T: int, 2025-05-07T20:32:18.0512988Z D: int, 2025-05-07T20:32:18.0513211Z scale_ub: Optional[float], 2025-05-07T20:32:18.0513486Z contiguous: bool, 2025-05-07T20:32:18.0513719Z compiled: bool, 2025-05-07T20:32:18.0513946Z ) -> None: 2025-05-07T20:32:18.0514167Z torch.manual_seed(2025) 2025-05-07T20:32:18.0514402Z 2025-05-07T20:32:18.0514672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.0515014Z 2025-05-07T20:32:18.0515209Z x_sign = torch.sign(x) 2025-05-07T20:32:18.0515491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.0515801Z x = x_sign * x_clamp 2025-05-07T20:32:18.0516119Z x0 = x[:, :D] 2025-05-07T20:32:18.0516334Z x1 = x[:, D:] 2025-05-07T20:32:18.0516545Z 2025-05-07T20:32:18.0516746Z if contiguous: 2025-05-07T20:32:18.0516983Z x0 = x0.contiguous() 2025-05-07T20:32:18.0517247Z x1 = x1.contiguous() 2025-05-07T20:32:18.0517484Z 2025-05-07T20:32:18.0517680Z if scale_ub is not None: 2025-05-07T20:32:18.0517957Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.0518292Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.0518666Z ) 2025-05-07T20:32:18.0526695Z else: 2025-05-07T20:32:18.0526947Z scale_ub_tensor = None 2025-05-07T20:32:18.0527215Z 2025-05-07T20:32:18.0527456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.0527787Z op = silu_mul_quant 2025-05-07T20:32:18.0528052Z if compiled: 2025-05-07T20:32:18.0528299Z op = torch.compile(op) 2025-05-07T20:32:18.0528614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0528896Z 2025-05-07T20:32:18.0529093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.0529269Z 2025-05-07T20:32:18.0529371Z moe/activation_test.py:117: 2025-05-07T20:32:18.0529794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0530137Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.0530423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.0530990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.0531557Z return fn(*args, **kwargs) 2025-05-07T20:32:18.0532210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.0532899Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.0533441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.0534215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.0534874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.0535416Z kernel = self.compile( 2025-05-07T20:32:18.0535964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.0536617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.0537022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.0537258Z 2025-05-07T20:32:18.0537467Z self = 2025-05-07T20:32:18.0538541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.0539904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda445cea0>} 2025-05-07T20:32:18.0541235Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.0542254Z context = 2025-05-07T20:32:18.0542552Z 2025-05-07T20:32:18.0542722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.0543246Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.0543709Z module_map=module_map) 2025-05-07T20:32:18.0544083Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.0544496Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.0544758Z E ^ 2025-05-07T20:32:18.0545217Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.0545673Z 2025-05-07T20:32:18.0546084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.0546587Z 2025-05-07T20:32:18.0546692Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.0547154Z self=, 2025-05-07T20:32:18.0547557Z T=2048, 2025-05-07T20:32:18.0547748Z D=7168, 2025-05-07T20:32:18.0547947Z scale_ub=None, 2025-05-07T20:32:18.0548172Z contiguous=True, 2025-05-07T20:32:18.0548393Z compiled=True, 2025-05-07T20:32:18.1753513Z ) 2025-05-07T20:32:18.1754093Z self = 2025-05-07T20:32:18.1754789Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:18.1755108Z 2025-05-07T20:32:18.1755197Z @given( 2025-05-07T20:32:18.1755428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1756121Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1756438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1756764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1757101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1757398Z ) 2025-05-07T20:32:18.1757744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1758184Z def test_silu_mul_quant( 2025-05-07T20:32:18.1758433Z self, 2025-05-07T20:32:18.1758631Z T: int, 2025-05-07T20:32:18.1758827Z D: int, 2025-05-07T20:32:18.1759047Z scale_ub: Optional[float], 2025-05-07T20:32:18.1759327Z contiguous: bool, 2025-05-07T20:32:18.1759566Z compiled: bool, 2025-05-07T20:32:18.1759797Z ) -> None: 2025-05-07T20:32:18.1760017Z torch.manual_seed(2025) 2025-05-07T20:32:18.1760256Z 2025-05-07T20:32:18.1760535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1760883Z 2025-05-07T20:32:18.1761074Z x_sign = torch.sign(x) 2025-05-07T20:32:18.1761369Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.1761681Z x = x_sign * x_clamp 2025-05-07T20:32:18.1761924Z x0 = x[:, :D] 2025-05-07T20:32:18.1762144Z x1 = x[:, D:] 2025-05-07T20:32:18.1762354Z 2025-05-07T20:32:18.1762545Z if contiguous: 2025-05-07T20:32:18.1762777Z x0 = x0.contiguous() 2025-05-07T20:32:18.1763039Z x1 = x1.contiguous() 2025-05-07T20:32:18.1763284Z 2025-05-07T20:32:18.1763474Z if scale_ub is not None: 2025-05-07T20:32:18.1763752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.1764097Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.1764402Z ) 2025-05-07T20:32:18.1764603Z else: 2025-05-07T20:32:18.1764821Z scale_ub_tensor = None 2025-05-07T20:32:18.1765075Z 2025-05-07T20:32:18.1765319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.1765642Z op = silu_mul_quant 2025-05-07T20:32:18.1765891Z if compiled: 2025-05-07T20:32:18.1766147Z op = torch.compile(op) 2025-05-07T20:32:18.1766455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1766731Z 2025-05-07T20:32:18.1766927Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.1767098Z 2025-05-07T20:32:18.1767201Z moe/activation_test.py:117: 2025-05-07T20:32:18.1767504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1767835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.1768122Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.1768776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.1769335Z return fn(*args, **kwargs) 2025-05-07T20:32:18.1769995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.1770679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.1771217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.1771970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.1772628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.1773166Z kernel = self.compile( 2025-05-07T20:32:18.1773803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.1774465Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.1774865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.1775092Z 2025-05-07T20:32:18.1775394Z self = 2025-05-07T20:32:18.1776458Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.1777835Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda445dc60>} 2025-05-07T20:32:18.1779165Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.1780181Z context = 2025-05-07T20:32:18.1780464Z 2025-05-07T20:32:18.1780637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.1781153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.1781620Z module_map=module_map) 2025-05-07T20:32:18.1781989Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.1782345Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.1782608Z E ^ 2025-05-07T20:32:18.1783065Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.1783509Z 2025-05-07T20:32:18.1783925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.1784432Z 2025-05-07T20:32:18.1784539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.1784951Z self=, 2025-05-07T20:32:18.1785352Z T=16384, 2025-05-07T20:32:18.1785542Z D=5120, 2025-05-07T20:32:18.1785747Z scale_ub=None, 2025-05-07T20:32:18.1785970Z contiguous=False, 2025-05-07T20:32:18.1786196Z compiled=False, 2025-05-07T20:32:18.1786405Z ) 2025-05-07T20:32:18.1786725Z self = 2025-05-07T20:32:18.1787230Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.1787505Z 2025-05-07T20:32:18.1787585Z @given( 2025-05-07T20:32:18.1787823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1788143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1788448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1788784Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1789174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1789457Z ) 2025-05-07T20:32:18.1789808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1790258Z def test_silu_mul_quant( 2025-05-07T20:32:18.1790506Z self, 2025-05-07T20:32:18.1790701Z T: int, 2025-05-07T20:32:18.1790908Z D: int, 2025-05-07T20:32:18.1791132Z scale_ub: Optional[float], 2025-05-07T20:32:18.1791447Z contiguous: bool, 2025-05-07T20:32:18.1791695Z compiled: bool, 2025-05-07T20:32:18.1791924Z ) -> None: 2025-05-07T20:32:18.1792142Z torch.manual_seed(2025) 2025-05-07T20:32:18.1792386Z 2025-05-07T20:32:18.1792658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1792994Z 2025-05-07T20:32:18.1793193Z x_sign = torch.sign(x) 2025-05-07T20:32:18.1793485Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.1795553Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.1797384Z 2025-05-07T20:32:18.1797512Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:18.1797722Z 2025-05-07T20:32:18.1797833Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.1798500Z self=, 2025-05-07T20:32:18.1798908Z T=4096, 2025-05-07T20:32:18.1799101Z D=7168, 2025-05-07T20:32:18.1799293Z scale_ub=1200.0, 2025-05-07T20:32:18.1799518Z contiguous=True, 2025-05-07T20:32:18.1799743Z compiled=True, 2025-05-07T20:32:18.1799943Z ) 2025-05-07T20:32:18.1800264Z self = 2025-05-07T20:32:18.1800760Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.1801027Z 2025-05-07T20:32:18.1801107Z @given( 2025-05-07T20:32:18.1801340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1801659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1801966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1802293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1802622Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1802913Z ) 2025-05-07T20:32:18.1803260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1803711Z def test_silu_mul_quant( 2025-05-07T20:32:18.1803954Z self, 2025-05-07T20:32:18.1804148Z T: int, 2025-05-07T20:32:18.1804348Z D: int, 2025-05-07T20:32:18.1804574Z scale_ub: Optional[float], 2025-05-07T20:32:18.1804843Z contiguous: bool, 2025-05-07T20:32:18.1805096Z compiled: bool, 2025-05-07T20:32:18.1805325Z ) -> None: 2025-05-07T20:32:18.1805556Z torch.manual_seed(2025) 2025-05-07T20:32:18.1805799Z 2025-05-07T20:32:18.1806065Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1806415Z 2025-05-07T20:32:18.1806617Z x_sign = torch.sign(x) 2025-05-07T20:32:18.1806902Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.1808867Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.1810800Z 2025-05-07T20:32:18.1810920Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:18.1811135Z 2025-05-07T20:32:18.1811238Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.1811711Z self=, 2025-05-07T20:32:18.1812107Z T=16384, 2025-05-07T20:32:18.1812306Z D=7168, 2025-05-07T20:32:18.1812505Z scale_ub=None, 2025-05-07T20:32:18.1812717Z contiguous=False, 2025-05-07T20:32:18.1812952Z compiled=False, 2025-05-07T20:32:18.1813159Z ) 2025-05-07T20:32:18.1813473Z self = 2025-05-07T20:32:18.1814042Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.1814322Z 2025-05-07T20:32:18.1814399Z @given( 2025-05-07T20:32:18.1814642Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.1815069Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.1815377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.1815708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.1816031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.1816321Z ) 2025-05-07T20:32:18.1816668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.1817109Z def test_silu_mul_quant( 2025-05-07T20:32:18.1817346Z self, 2025-05-07T20:32:18.1817545Z T: int, 2025-05-07T20:32:18.1817744Z D: int, 2025-05-07T20:32:18.1817958Z scale_ub: Optional[float], 2025-05-07T20:32:18.1818232Z contiguous: bool, 2025-05-07T20:32:18.1818480Z compiled: bool, 2025-05-07T20:32:18.1818698Z ) -> None: 2025-05-07T20:32:18.1818921Z torch.manual_seed(2025) 2025-05-07T20:32:18.1819170Z 2025-05-07T20:32:18.1819439Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.1821450Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.1823274Z 2025-05-07T20:32:18.1823391Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.3068444Z 2025-05-07T20:32:18.3069057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3069696Z self=, 2025-05-07T20:32:18.3070317Z T=2048, 2025-05-07T20:32:18.3070582Z D=7168, 2025-05-07T20:32:18.3070821Z scale_ub=1200.0, 2025-05-07T20:32:18.3071080Z contiguous=True, 2025-05-07T20:32:18.3071316Z compiled=True, 2025-05-07T20:32:18.3071532Z ) 2025-05-07T20:32:18.3071861Z self = 2025-05-07T20:32:18.3072369Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.3072637Z 2025-05-07T20:32:18.3072720Z @given( 2025-05-07T20:32:18.3072963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3073283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3073596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3073924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3074545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3074840Z ) 2025-05-07T20:32:18.3075188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3075636Z def test_silu_mul_quant( 2025-05-07T20:32:18.3075892Z self, 2025-05-07T20:32:18.3076089Z T: int, 2025-05-07T20:32:18.3076295Z D: int, 2025-05-07T20:32:18.3076519Z scale_ub: Optional[float], 2025-05-07T20:32:18.3076789Z contiguous: bool, 2025-05-07T20:32:18.3077133Z compiled: bool, 2025-05-07T20:32:18.3077367Z ) -> None: 2025-05-07T20:32:18.3077582Z torch.manual_seed(2025) 2025-05-07T20:32:18.3077829Z 2025-05-07T20:32:18.3078102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3078447Z 2025-05-07T20:32:18.3078638Z x_sign = torch.sign(x) 2025-05-07T20:32:18.3078933Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.3081049Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.3082874Z 2025-05-07T20:32:18.3082998Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:18.3083209Z 2025-05-07T20:32:18.3083313Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3083724Z self=, 2025-05-07T20:32:18.3084127Z T=2048, 2025-05-07T20:32:18.3084325Z D=7168, 2025-05-07T20:32:18.3084518Z scale_ub=None, 2025-05-07T20:32:18.3084736Z contiguous=True, 2025-05-07T20:32:18.3084966Z compiled=False, 2025-05-07T20:32:18.3085169Z ) 2025-05-07T20:32:18.3085489Z self = 2025-05-07T20:32:18.3085981Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.3086251Z 2025-05-07T20:32:18.3086330Z @given( 2025-05-07T20:32:18.3086562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3086878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3087182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3087512Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3087843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3088132Z ) 2025-05-07T20:32:18.3088476Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3088920Z def test_silu_mul_quant( 2025-05-07T20:32:18.3089169Z self, 2025-05-07T20:32:18.3089364Z T: int, 2025-05-07T20:32:18.3089572Z D: int, 2025-05-07T20:32:18.3089797Z scale_ub: Optional[float], 2025-05-07T20:32:18.3090069Z contiguous: bool, 2025-05-07T20:32:18.3090314Z compiled: bool, 2025-05-07T20:32:18.3090547Z ) -> None: 2025-05-07T20:32:18.3090764Z torch.manual_seed(2025) 2025-05-07T20:32:18.3091011Z 2025-05-07T20:32:18.3091287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3091626Z 2025-05-07T20:32:18.3091828Z > x_sign = torch.sign(x) 2025-05-07T20:32:18.3093838Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.3095702Z 2025-05-07T20:32:18.3095826Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:18.3096038Z 2025-05-07T20:32:18.3096148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3096553Z self=, 2025-05-07T20:32:18.3097000Z T=1, 2025-05-07T20:32:18.3097189Z D=7168, 2025-05-07T20:32:18.3097385Z scale_ub=1200.0, 2025-05-07T20:32:18.3097613Z contiguous=True, 2025-05-07T20:32:18.3097837Z compiled=False, 2025-05-07T20:32:18.3098038Z ) 2025-05-07T20:32:18.3098639Z self = 2025-05-07T20:32:18.3099124Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.3099389Z 2025-05-07T20:32:18.3099474Z @given( 2025-05-07T20:32:18.3099706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3100021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3100327Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3100781Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3101115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3101405Z ) 2025-05-07T20:32:18.3101751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3102198Z def test_silu_mul_quant( 2025-05-07T20:32:18.3102442Z self, 2025-05-07T20:32:18.3102635Z T: int, 2025-05-07T20:32:18.3102838Z D: int, 2025-05-07T20:32:18.3103060Z scale_ub: Optional[float], 2025-05-07T20:32:18.3103333Z contiguous: bool, 2025-05-07T20:32:18.3103570Z compiled: bool, 2025-05-07T20:32:18.3103798Z ) -> None: 2025-05-07T20:32:18.3104016Z torch.manual_seed(2025) 2025-05-07T20:32:18.3104259Z 2025-05-07T20:32:18.3104532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3104880Z 2025-05-07T20:32:18.3105072Z x_sign = torch.sign(x) 2025-05-07T20:32:18.3105371Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.3105682Z x = x_sign * x_clamp 2025-05-07T20:32:18.3105924Z x0 = x[:, :D] 2025-05-07T20:32:18.3106148Z x1 = x[:, D:] 2025-05-07T20:32:18.3106366Z 2025-05-07T20:32:18.3106554Z if contiguous: 2025-05-07T20:32:18.3106792Z x0 = x0.contiguous() 2025-05-07T20:32:18.3107060Z x1 = x1.contiguous() 2025-05-07T20:32:18.3107299Z 2025-05-07T20:32:18.3107496Z if scale_ub is not None: 2025-05-07T20:32:18.3107776Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.3108110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.3108424Z ) 2025-05-07T20:32:18.3108628Z else: 2025-05-07T20:32:18.3108845Z scale_ub_tensor = None 2025-05-07T20:32:18.3109095Z 2025-05-07T20:32:18.3109333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.3109657Z op = silu_mul_quant 2025-05-07T20:32:18.3109915Z if compiled: 2025-05-07T20:32:18.3110191Z op = torch.compile(op) 2025-05-07T20:32:18.3110518Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3110793Z 2025-05-07T20:32:18.3110991Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.3111156Z 2025-05-07T20:32:18.3111261Z moe/activation_test.py:117: 2025-05-07T20:32:18.3111555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3111888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.3112171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3112859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.3113618Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.3114157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.3114841Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.3115497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.3116034Z kernel = self.compile( 2025-05-07T20:32:18.3116640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.3117297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3117694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3117928Z 2025-05-07T20:32:18.3118138Z self = 2025-05-07T20:32:18.3119216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.3120680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e4b80>} 2025-05-07T20:32:18.3122004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.3123022Z context = 2025-05-07T20:32:18.3123318Z 2025-05-07T20:32:18.3123484Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.3124002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3124466Z module_map=module_map) 2025-05-07T20:32:18.3124831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3125187Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3125456Z E ^ 2025-05-07T20:32:18.3125913Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3126358Z 2025-05-07T20:32:18.3126769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.3127283Z 2025-05-07T20:32:18.3127411Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.3127830Z self=, 2025-05-07T20:32:18.3128228Z T=128, 2025-05-07T20:32:18.3128426Z D=5120, 2025-05-07T20:32:18.3128630Z scale_ub=None, 2025-05-07T20:32:18.3128848Z contiguous=True, 2025-05-07T20:32:18.3129079Z compiled=False, 2025-05-07T20:32:18.3138029Z ) 2025-05-07T20:32:18.3138389Z self = 2025-05-07T20:32:18.3138882Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.3139155Z 2025-05-07T20:32:18.3139243Z @given( 2025-05-07T20:32:18.3139473Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.3139789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.3140103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.3140478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.3140806Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.3141093Z ) 2025-05-07T20:32:18.3141442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.3141876Z def test_silu_mul_quant( 2025-05-07T20:32:18.3142118Z self, 2025-05-07T20:32:18.3142409Z T: int, 2025-05-07T20:32:18.3142605Z D: int, 2025-05-07T20:32:18.3142828Z scale_ub: Optional[float], 2025-05-07T20:32:18.3143106Z contiguous: bool, 2025-05-07T20:32:18.3143339Z compiled: bool, 2025-05-07T20:32:18.3143565Z ) -> None: 2025-05-07T20:32:18.3143788Z torch.manual_seed(2025) 2025-05-07T20:32:18.3144023Z 2025-05-07T20:32:18.3144298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.3144641Z 2025-05-07T20:32:18.3144893Z x_sign = torch.sign(x) 2025-05-07T20:32:18.3145187Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.3145502Z x = x_sign * x_clamp 2025-05-07T20:32:18.3145745Z x0 = x[:, :D] 2025-05-07T20:32:18.3145953Z x1 = x[:, D:] 2025-05-07T20:32:18.3146163Z 2025-05-07T20:32:18.3146350Z if contiguous: 2025-05-07T20:32:18.3146577Z x0 = x0.contiguous() 2025-05-07T20:32:18.3146839Z x1 = x1.contiguous() 2025-05-07T20:32:18.3147087Z 2025-05-07T20:32:18.3147277Z if scale_ub is not None: 2025-05-07T20:32:18.3147555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.3147882Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.3148280Z ) 2025-05-07T20:32:18.3148478Z else: 2025-05-07T20:32:18.3148683Z scale_ub_tensor = None 2025-05-07T20:32:18.3148937Z 2025-05-07T20:32:18.3149168Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.3149485Z op = silu_mul_quant 2025-05-07T20:32:18.3149731Z if compiled: 2025-05-07T20:32:18.3149970Z op = torch.compile(op) 2025-05-07T20:32:18.3150254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3150526Z 2025-05-07T20:32:18.3150720Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.3150884Z 2025-05-07T20:32:18.3150989Z moe/activation_test.py:117: 2025-05-07T20:32:18.3151280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3151611Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.3151889Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.3152571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.3153252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.3153786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.3154458Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.3155108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.3155637Z kernel = self.compile( 2025-05-07T20:32:18.3156173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.3156820Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.3157215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.3157445Z 2025-05-07T20:32:18.3157655Z self = 2025-05-07T20:32:18.3158724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.3160095Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e5a80>} 2025-05-07T20:32:18.3161445Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.3162512Z context = 2025-05-07T20:32:18.3162794Z 2025-05-07T20:32:18.3162964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.3163480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.3163937Z module_map=module_map) 2025-05-07T20:32:18.3164303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.3164697Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.3164949Z E ^ 2025-05-07T20:32:18.3165407Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.3165844Z 2025-05-07T20:32:18.3166258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.4287923Z 2025-05-07T20:32:18.4289150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4290056Z self=, 2025-05-07T20:32:18.4290472Z T=128, 2025-05-07T20:32:18.4290674Z D=7168, 2025-05-07T20:32:18.4290870Z scale_ub=None, 2025-05-07T20:32:18.4291441Z contiguous=True, 2025-05-07T20:32:18.4291672Z compiled=False, 2025-05-07T20:32:18.4291886Z ) 2025-05-07T20:32:18.4292204Z self = 2025-05-07T20:32:18.4292708Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.4292982Z 2025-05-07T20:32:18.4293060Z @given( 2025-05-07T20:32:18.4293297Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.4293608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.4294044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.4294376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.4294706Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.4294993Z ) 2025-05-07T20:32:18.4295342Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.4295780Z def test_silu_mul_quant( 2025-05-07T20:32:18.4296033Z self, 2025-05-07T20:32:18.4296233Z T: int, 2025-05-07T20:32:18.4296426Z D: int, 2025-05-07T20:32:18.4296645Z scale_ub: Optional[float], 2025-05-07T20:32:18.4296919Z contiguous: bool, 2025-05-07T20:32:18.4297161Z compiled: bool, 2025-05-07T20:32:18.4297382Z ) -> None: 2025-05-07T20:32:18.4297598Z torch.manual_seed(2025) 2025-05-07T20:32:18.4297836Z 2025-05-07T20:32:18.4298105Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4298612Z 2025-05-07T20:32:18.4298806Z x_sign = torch.sign(x) 2025-05-07T20:32:18.4299091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.4299405Z x = x_sign * x_clamp 2025-05-07T20:32:18.4299646Z x0 = x[:, :D] 2025-05-07T20:32:18.4299855Z x1 = x[:, D:] 2025-05-07T20:32:18.4300065Z 2025-05-07T20:32:18.4300250Z if contiguous: 2025-05-07T20:32:18.4300476Z x0 = x0.contiguous() 2025-05-07T20:32:18.4300738Z x1 = x1.contiguous() 2025-05-07T20:32:18.4300983Z 2025-05-07T20:32:18.4301169Z if scale_ub is not None: 2025-05-07T20:32:18.4301445Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.4301776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.4302086Z ) 2025-05-07T20:32:18.4302274Z else: 2025-05-07T20:32:18.4302487Z scale_ub_tensor = None 2025-05-07T20:32:18.4302739Z 2025-05-07T20:32:18.4302965Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4303283Z op = silu_mul_quant 2025-05-07T20:32:18.4303534Z if compiled: 2025-05-07T20:32:18.4303875Z op = torch.compile(op) 2025-05-07T20:32:18.4304173Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4304482Z 2025-05-07T20:32:18.4304676Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.4304839Z 2025-05-07T20:32:18.4304944Z moe/activation_test.py:117: 2025-05-07T20:32:18.4305242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4305583Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.4305868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4306645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.4307330Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.4307864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.4308543Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.4309197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.4309730Z kernel = self.compile( 2025-05-07T20:32:18.4310937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.4311599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.4311987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4312223Z 2025-05-07T20:32:18.4312431Z self = 2025-05-07T20:32:18.4313500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.4314869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e6980>} 2025-05-07T20:32:18.4316195Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.4317201Z context = 2025-05-07T20:32:18.4317489Z 2025-05-07T20:32:18.4317655Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.4318166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.4318621Z module_map=module_map) 2025-05-07T20:32:18.4318987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.4319339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.4319593Z E ^ 2025-05-07T20:32:18.4320052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.4320496Z 2025-05-07T20:32:18.4320908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.4321412Z 2025-05-07T20:32:18.4321519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4321925Z self=, 2025-05-07T20:32:18.4322328Z T=2048, 2025-05-07T20:32:18.4322518Z D=7168, 2025-05-07T20:32:18.4322708Z scale_ub=1200.0, 2025-05-07T20:32:18.4322930Z contiguous=True, 2025-05-07T20:32:18.4323156Z compiled=False, 2025-05-07T20:32:18.4323359Z ) 2025-05-07T20:32:18.4323678Z self = 2025-05-07T20:32:18.4324165Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.4324483Z 2025-05-07T20:32:18.4324567Z @given( 2025-05-07T20:32:18.4324791Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.4325102Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.4325408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.4325738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.4326063Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.4326350Z ) 2025-05-07T20:32:18.4326691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.4327178Z def test_silu_mul_quant( 2025-05-07T20:32:18.4327422Z self, 2025-05-07T20:32:18.4327618Z T: int, 2025-05-07T20:32:18.4327811Z D: int, 2025-05-07T20:32:18.4328033Z scale_ub: Optional[float], 2025-05-07T20:32:18.4328301Z contiguous: bool, 2025-05-07T20:32:18.4328555Z compiled: bool, 2025-05-07T20:32:18.4328774Z ) -> None: 2025-05-07T20:32:18.4328998Z torch.manual_seed(2025) 2025-05-07T20:32:18.4329239Z 2025-05-07T20:32:18.4329514Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4331661Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.4333483Z 2025-05-07T20:32:18.4333600Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.4333879Z 2025-05-07T20:32:18.4333985Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.4334393Z self=, 2025-05-07T20:32:18.4334786Z T=1, 2025-05-07T20:32:18.4334974Z D=5120, 2025-05-07T20:32:18.4335170Z scale_ub=1200.0, 2025-05-07T20:32:18.4335396Z contiguous=True, 2025-05-07T20:32:18.4335613Z compiled=False, 2025-05-07T20:32:18.4335821Z ) 2025-05-07T20:32:18.4336148Z self = 2025-05-07T20:32:18.4336621Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.4336892Z 2025-05-07T20:32:18.4336968Z @given( 2025-05-07T20:32:18.4337197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.4337501Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.4337803Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.4338129Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.4338447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.4338734Z ) 2025-05-07T20:32:18.4339081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.4339523Z def test_silu_mul_quant( 2025-05-07T20:32:18.4339756Z self, 2025-05-07T20:32:18.4339951Z T: int, 2025-05-07T20:32:18.4340145Z D: int, 2025-05-07T20:32:18.4340360Z scale_ub: Optional[float], 2025-05-07T20:32:18.4340654Z contiguous: bool, 2025-05-07T20:32:18.4340920Z compiled: bool, 2025-05-07T20:32:18.4341138Z ) -> None: 2025-05-07T20:32:18.4341355Z torch.manual_seed(2025) 2025-05-07T20:32:18.4341598Z 2025-05-07T20:32:18.4341862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.4342202Z 2025-05-07T20:32:18.4342399Z x_sign = torch.sign(x) 2025-05-07T20:32:18.4342681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.4342992Z x = x_sign * x_clamp 2025-05-07T20:32:18.4343237Z x0 = x[:, :D] 2025-05-07T20:32:18.4343500Z x1 = x[:, D:] 2025-05-07T20:32:18.4343715Z 2025-05-07T20:32:18.4343900Z if contiguous: 2025-05-07T20:32:18.4344124Z x0 = x0.contiguous() 2025-05-07T20:32:18.4344380Z x1 = x1.contiguous() 2025-05-07T20:32:18.4344625Z 2025-05-07T20:32:18.4344823Z if scale_ub is not None: 2025-05-07T20:32:18.4345093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.4345425Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.4345736Z ) 2025-05-07T20:32:18.4345968Z else: 2025-05-07T20:32:18.4346178Z scale_ub_tensor = None 2025-05-07T20:32:18.4346427Z 2025-05-07T20:32:18.4346651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.4346962Z op = silu_mul_quant 2025-05-07T20:32:18.4347214Z if compiled: 2025-05-07T20:32:18.4347457Z op = torch.compile(op) 2025-05-07T20:32:18.4347751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4348030Z 2025-05-07T20:32:18.4348217Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.4348388Z 2025-05-07T20:32:18.4348485Z moe/activation_test.py:117: 2025-05-07T20:32:18.4348778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4349217Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.4349496Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.4350176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.4350897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.4351438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.4352110Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.4352762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.4353294Z kernel = self.compile( 2025-05-07T20:32:18.4353824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.4354477Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.4354875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.4355101Z 2025-05-07T20:32:18.4355311Z self = 2025-05-07T20:32:18.4356376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.4357720Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e7e20>} 2025-05-07T20:32:18.4359045Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.4360055Z context = 2025-05-07T20:32:18.4360337Z 2025-05-07T20:32:18.4360501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.4361021Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.4361484Z module_map=module_map) 2025-05-07T20:32:18.4361846Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.4362192Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.4362456Z E ^ 2025-05-07T20:32:18.4362911Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.4363399Z 2025-05-07T20:32:18.4363808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.5202034Z 2025-05-07T20:32:18.5202207Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5202661Z self=, 2025-05-07T20:32:18.5203087Z T=2048, 2025-05-07T20:32:18.5203376Z D=5120, 2025-05-07T20:32:18.5203589Z scale_ub=None, 2025-05-07T20:32:18.5203928Z contiguous=True, 2025-05-07T20:32:18.5204165Z compiled=False, 2025-05-07T20:32:18.5204379Z ) 2025-05-07T20:32:18.5204704Z self = 2025-05-07T20:32:18.5205208Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.5205488Z 2025-05-07T20:32:18.5205572Z @given( 2025-05-07T20:32:18.5205819Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5206142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5206459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5206802Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5207133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5207552Z ) 2025-05-07T20:32:18.5207920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5208364Z def test_silu_mul_quant( 2025-05-07T20:32:18.5208619Z self, 2025-05-07T20:32:18.5208830Z T: int, 2025-05-07T20:32:18.5209037Z D: int, 2025-05-07T20:32:18.5209260Z scale_ub: Optional[float], 2025-05-07T20:32:18.5209542Z contiguous: bool, 2025-05-07T20:32:18.5209791Z compiled: bool, 2025-05-07T20:32:18.5210025Z ) -> None: 2025-05-07T20:32:18.5210253Z torch.manual_seed(2025) 2025-05-07T20:32:18.5210507Z 2025-05-07T20:32:18.5210785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5211139Z 2025-05-07T20:32:18.5211346Z > x_sign = torch.sign(x) 2025-05-07T20:32:18.5213261Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5215214Z 2025-05-07T20:32:18.5215338Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:18.5215562Z 2025-05-07T20:32:18.5215668Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5216088Z self=, 2025-05-07T20:32:18.5216504Z T=16384, 2025-05-07T20:32:18.5216703Z D=5120, 2025-05-07T20:32:18.5216907Z scale_ub=None, 2025-05-07T20:32:18.5217133Z contiguous=True, 2025-05-07T20:32:18.5217360Z compiled=False, 2025-05-07T20:32:18.5217578Z ) 2025-05-07T20:32:18.5217916Z self = 2025-05-07T20:32:18.5218409Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.5218693Z 2025-05-07T20:32:18.5218775Z @given( 2025-05-07T20:32:18.5219025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5219341Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5219664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5220008Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5220349Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5220638Z ) 2025-05-07T20:32:18.5220996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5221517Z def test_silu_mul_quant( 2025-05-07T20:32:18.5221762Z self, 2025-05-07T20:32:18.5221964Z T: int, 2025-05-07T20:32:18.5222169Z D: int, 2025-05-07T20:32:18.5222392Z scale_ub: Optional[float], 2025-05-07T20:32:18.5222676Z contiguous: bool, 2025-05-07T20:32:18.5222924Z compiled: bool, 2025-05-07T20:32:18.5223150Z ) -> None: 2025-05-07T20:32:18.5223377Z torch.manual_seed(2025) 2025-05-07T20:32:18.5223674Z 2025-05-07T20:32:18.5223946Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5225954Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5227774Z 2025-05-07T20:32:18.5227969Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5228192Z 2025-05-07T20:32:18.5228298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5228717Z self=, 2025-05-07T20:32:18.5229121Z T=4096, 2025-05-07T20:32:18.5229317Z D=5120, 2025-05-07T20:32:18.5229518Z scale_ub=None, 2025-05-07T20:32:18.5229733Z contiguous=True, 2025-05-07T20:32:18.5229964Z compiled=False, 2025-05-07T20:32:18.5230177Z ) 2025-05-07T20:32:18.5230498Z self = 2025-05-07T20:32:18.5230997Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.5231280Z 2025-05-07T20:32:18.5231363Z @given( 2025-05-07T20:32:18.5231604Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5231922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5232242Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5232580Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5232911Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5233206Z ) 2025-05-07T20:32:18.5233565Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5234020Z def test_silu_mul_quant( 2025-05-07T20:32:18.5234266Z self, 2025-05-07T20:32:18.5234473Z T: int, 2025-05-07T20:32:18.5234681Z D: int, 2025-05-07T20:32:18.5234903Z scale_ub: Optional[float], 2025-05-07T20:32:18.5235184Z contiguous: bool, 2025-05-07T20:32:18.5235438Z compiled: bool, 2025-05-07T20:32:18.5235664Z ) -> None: 2025-05-07T20:32:18.5235892Z torch.manual_seed(2025) 2025-05-07T20:32:18.5236146Z 2025-05-07T20:32:18.5236419Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5238422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5240240Z 2025-05-07T20:32:18.5240361Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5240581Z 2025-05-07T20:32:18.5240688Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5241203Z self=, 2025-05-07T20:32:18.5241606Z T=2048, 2025-05-07T20:32:18.5250232Z D=5120, 2025-05-07T20:32:18.5250470Z scale_ub=None, 2025-05-07T20:32:18.5250720Z contiguous=False, 2025-05-07T20:32:18.5250991Z compiled=False, 2025-05-07T20:32:18.5251204Z ) 2025-05-07T20:32:18.5251526Z self = 2025-05-07T20:32:18.5252030Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.5252383Z 2025-05-07T20:32:18.5252475Z @given( 2025-05-07T20:32:18.5252711Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5253042Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5253356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5253801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5254143Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5254440Z ) 2025-05-07T20:32:18.5254795Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5255238Z def test_silu_mul_quant( 2025-05-07T20:32:18.5255486Z self, 2025-05-07T20:32:18.5255688Z T: int, 2025-05-07T20:32:18.5255969Z D: int, 2025-05-07T20:32:18.5256196Z scale_ub: Optional[float], 2025-05-07T20:32:18.5256476Z contiguous: bool, 2025-05-07T20:32:18.5256715Z compiled: bool, 2025-05-07T20:32:18.5256950Z ) -> None: 2025-05-07T20:32:18.5257179Z torch.manual_seed(2025) 2025-05-07T20:32:18.5257418Z 2025-05-07T20:32:18.5257698Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5259719Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5261589Z 2025-05-07T20:32:18.5261712Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5261925Z 2025-05-07T20:32:18.5262038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5263859Z self=, 2025-05-07T20:32:18.5264270Z T=4096, 2025-05-07T20:32:18.5264472Z D=7168, 2025-05-07T20:32:18.5264677Z scale_ub=None, 2025-05-07T20:32:18.5264890Z contiguous=True, 2025-05-07T20:32:18.5265123Z compiled=True, 2025-05-07T20:32:18.5265336Z ) 2025-05-07T20:32:18.5265650Z self = 2025-05-07T20:32:18.5266139Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:18.5266410Z 2025-05-07T20:32:18.5266490Z @given( 2025-05-07T20:32:18.5266728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5267040Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5267356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5267688Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5268022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5268309Z ) 2025-05-07T20:32:18.5268661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5269108Z def test_silu_mul_quant( 2025-05-07T20:32:18.5269346Z self, 2025-05-07T20:32:18.5269549Z T: int, 2025-05-07T20:32:18.5269753Z D: int, 2025-05-07T20:32:18.5269974Z scale_ub: Optional[float], 2025-05-07T20:32:18.5270248Z contiguous: bool, 2025-05-07T20:32:18.5270545Z compiled: bool, 2025-05-07T20:32:18.5270790Z ) -> None: 2025-05-07T20:32:18.5271032Z torch.manual_seed(2025) 2025-05-07T20:32:18.5271275Z 2025-05-07T20:32:18.5271541Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5273538Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5275404Z 2025-05-07T20:32:18.5275523Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5275742Z 2025-05-07T20:32:18.5275845Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5276258Z self=, 2025-05-07T20:32:18.5276652Z T=2048, 2025-05-07T20:32:18.5276842Z D=5120, 2025-05-07T20:32:18.5277040Z scale_ub=1200.0, 2025-05-07T20:32:18.5277337Z contiguous=False, 2025-05-07T20:32:18.5277566Z compiled=False, 2025-05-07T20:32:18.5822909Z ) 2025-05-07T20:32:18.5823780Z self = 2025-05-07T20:32:18.5824544Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:18.5824918Z 2025-05-07T20:32:18.5825025Z @given( 2025-05-07T20:32:18.5825339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5825750Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5826071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5826403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5826758Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5827052Z ) 2025-05-07T20:32:18.5827400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5827847Z def test_silu_mul_quant( 2025-05-07T20:32:18.5828107Z self, 2025-05-07T20:32:18.5828297Z T: int, 2025-05-07T20:32:18.5828505Z D: int, 2025-05-07T20:32:18.5828728Z scale_ub: Optional[float], 2025-05-07T20:32:18.5829000Z contiguous: bool, 2025-05-07T20:32:18.5829251Z compiled: bool, 2025-05-07T20:32:18.5829489Z ) -> None: 2025-05-07T20:32:18.5829702Z torch.manual_seed(2025) 2025-05-07T20:32:18.5829949Z 2025-05-07T20:32:18.5830227Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5832252Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5834083Z 2025-05-07T20:32:18.5834209Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5834420Z 2025-05-07T20:32:18.5834526Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5834941Z self=, 2025-05-07T20:32:18.5835343Z T=4096, 2025-05-07T20:32:18.5835528Z D=7168, 2025-05-07T20:32:18.5835728Z scale_ub=1200.0, 2025-05-07T20:32:18.5835957Z contiguous=True, 2025-05-07T20:32:18.5836176Z compiled=False, 2025-05-07T20:32:18.5836392Z ) 2025-05-07T20:32:18.5836712Z self = 2025-05-07T20:32:18.5837513Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.5837784Z 2025-05-07T20:32:18.5837865Z @given( 2025-05-07T20:32:18.5838107Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5838427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5838726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5839059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5839487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5839761Z ) 2025-05-07T20:32:18.5840110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5840549Z def test_silu_mul_quant( 2025-05-07T20:32:18.5840791Z self, 2025-05-07T20:32:18.5840980Z T: int, 2025-05-07T20:32:18.5841180Z D: int, 2025-05-07T20:32:18.5841400Z scale_ub: Optional[float], 2025-05-07T20:32:18.5841671Z contiguous: bool, 2025-05-07T20:32:18.5841913Z compiled: bool, 2025-05-07T20:32:18.5842141Z ) -> None: 2025-05-07T20:32:18.5842350Z torch.manual_seed(2025) 2025-05-07T20:32:18.5842594Z 2025-05-07T20:32:18.5843019Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5845015Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5846837Z 2025-05-07T20:32:18.5846959Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5847177Z 2025-05-07T20:32:18.5847279Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5847694Z self=, 2025-05-07T20:32:18.5848096Z T=16384, 2025-05-07T20:32:18.5848291Z D=7168, 2025-05-07T20:32:18.5848526Z scale_ub=None, 2025-05-07T20:32:18.5848740Z contiguous=False, 2025-05-07T20:32:18.5848969Z compiled=True, 2025-05-07T20:32:18.5849176Z ) 2025-05-07T20:32:18.5849488Z self = 2025-05-07T20:32:18.5849985Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:18.5850273Z 2025-05-07T20:32:18.5850370Z @given( 2025-05-07T20:32:18.5850624Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5850941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5851251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5851578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5851910Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5852201Z ) 2025-05-07T20:32:18.5852551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5852992Z def test_silu_mul_quant( 2025-05-07T20:32:18.5853240Z self, 2025-05-07T20:32:18.5853441Z T: int, 2025-05-07T20:32:18.5853633Z D: int, 2025-05-07T20:32:18.5853969Z scale_ub: Optional[float], 2025-05-07T20:32:18.5854248Z contiguous: bool, 2025-05-07T20:32:18.5854480Z compiled: bool, 2025-05-07T20:32:18.5854704Z ) -> None: 2025-05-07T20:32:18.5854917Z torch.manual_seed(2025) 2025-05-07T20:32:18.5855154Z 2025-05-07T20:32:18.5855426Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5857423Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5859286Z 2025-05-07T20:32:18.5859443Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5859655Z 2025-05-07T20:32:18.5859765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5860178Z self=, 2025-05-07T20:32:18.5860617Z T=4096, 2025-05-07T20:32:18.5860809Z D=7168, 2025-05-07T20:32:18.5860998Z scale_ub=None, 2025-05-07T20:32:18.5861220Z contiguous=True, 2025-05-07T20:32:18.5861446Z compiled=False, 2025-05-07T20:32:18.5861651Z ) 2025-05-07T20:32:18.5861976Z self = 2025-05-07T20:32:18.5862466Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.5862732Z 2025-05-07T20:32:18.5862898Z @given( 2025-05-07T20:32:18.5863129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5863451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5863761Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5864090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5864426Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5864730Z ) 2025-05-07T20:32:18.5865075Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5865526Z def test_silu_mul_quant( 2025-05-07T20:32:18.5865774Z self, 2025-05-07T20:32:18.5865965Z T: int, 2025-05-07T20:32:18.5866173Z D: int, 2025-05-07T20:32:18.5866402Z scale_ub: Optional[float], 2025-05-07T20:32:18.5866673Z contiguous: bool, 2025-05-07T20:32:18.5866925Z compiled: bool, 2025-05-07T20:32:18.5867156Z ) -> None: 2025-05-07T20:32:18.5867388Z torch.manual_seed(2025) 2025-05-07T20:32:18.5867649Z 2025-05-07T20:32:18.5867919Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5869917Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5871790Z 2025-05-07T20:32:18.5871910Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5872131Z 2025-05-07T20:32:18.5872238Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5872663Z self=, 2025-05-07T20:32:18.5873057Z T=16384, 2025-05-07T20:32:18.5873258Z D=7168, 2025-05-07T20:32:18.5873454Z scale_ub=None, 2025-05-07T20:32:18.5873673Z contiguous=True, 2025-05-07T20:32:18.5873905Z compiled=False, 2025-05-07T20:32:18.5874117Z ) 2025-05-07T20:32:18.5874431Z self = 2025-05-07T20:32:18.5874932Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:18.5875214Z 2025-05-07T20:32:18.5875293Z @given( 2025-05-07T20:32:18.5875531Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5875841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5876243Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5876578Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5876903Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5877198Z ) 2025-05-07T20:32:18.5877573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5878189Z def test_silu_mul_quant( 2025-05-07T20:32:18.5878494Z self, 2025-05-07T20:32:18.5878698Z T: int, 2025-05-07T20:32:18.5878987Z D: int, 2025-05-07T20:32:18.5879207Z scale_ub: Optional[float], 2025-05-07T20:32:18.5879491Z contiguous: bool, 2025-05-07T20:32:18.5879735Z compiled: bool, 2025-05-07T20:32:18.5879962Z ) -> None: 2025-05-07T20:32:18.5880183Z torch.manual_seed(2025) 2025-05-07T20:32:18.5880427Z 2025-05-07T20:32:18.5880695Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5882807Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5884656Z 2025-05-07T20:32:18.5884773Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.5884983Z 2025-05-07T20:32:18.5885095Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.5885508Z self=, 2025-05-07T20:32:18.5885907Z T=16384, 2025-05-07T20:32:18.5886105Z D=7168, 2025-05-07T20:32:18.5886303Z scale_ub=1200.0, 2025-05-07T20:32:18.5886524Z contiguous=True, 2025-05-07T20:32:18.5886753Z compiled=False, 2025-05-07T20:32:18.5886966Z ) 2025-05-07T20:32:18.5887283Z self = 2025-05-07T20:32:18.5887787Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:18.5888063Z 2025-05-07T20:32:18.5888150Z @given( 2025-05-07T20:32:18.5888382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.5888704Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.5889130Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.5889573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.5889902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.5890218Z ) 2025-05-07T20:32:18.5890599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.5891037Z def test_silu_mul_quant( 2025-05-07T20:32:18.5891293Z self, 2025-05-07T20:32:18.5891493Z T: int, 2025-05-07T20:32:18.5891689Z D: int, 2025-05-07T20:32:18.5891916Z scale_ub: Optional[float], 2025-05-07T20:32:18.5892194Z contiguous: bool, 2025-05-07T20:32:18.5892431Z compiled: bool, 2025-05-07T20:32:18.5892664Z ) -> None: 2025-05-07T20:32:18.5892887Z torch.manual_seed(2025) 2025-05-07T20:32:18.5893133Z 2025-05-07T20:32:18.5893409Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.5895558Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.5897463Z 2025-05-07T20:32:18.5897582Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.7706888Z 2025-05-07T20:32:18.7707631Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.7708316Z self=, 2025-05-07T20:32:18.7708849Z T=128, 2025-05-07T20:32:18.7709042Z D=5120, 2025-05-07T20:32:18.7709250Z scale_ub=1200.0, 2025-05-07T20:32:18.7709781Z contiguous=False, 2025-05-07T20:32:18.7710014Z compiled=False, 2025-05-07T20:32:18.7710225Z ) 2025-05-07T20:32:18.7710546Z self = 2025-05-07T20:32:18.7711044Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:18.7711316Z 2025-05-07T20:32:18.7711400Z @given( 2025-05-07T20:32:18.7711640Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.7711967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.7712271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.7712605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.7713093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.7713380Z ) 2025-05-07T20:32:18.7713733Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.7714188Z def test_silu_mul_quant( 2025-05-07T20:32:18.7714450Z self, 2025-05-07T20:32:18.7714648Z T: int, 2025-05-07T20:32:18.7714858Z D: int, 2025-05-07T20:32:18.7715083Z scale_ub: Optional[float], 2025-05-07T20:32:18.7715356Z contiguous: bool, 2025-05-07T20:32:18.7715602Z compiled: bool, 2025-05-07T20:32:18.7715845Z ) -> None: 2025-05-07T20:32:18.7716063Z torch.manual_seed(2025) 2025-05-07T20:32:18.7716319Z 2025-05-07T20:32:18.7716602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.7716943Z 2025-05-07T20:32:18.7717146Z x_sign = torch.sign(x) 2025-05-07T20:32:18.7717454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.7717765Z x = x_sign * x_clamp 2025-05-07T20:32:18.7718025Z x0 = x[:, :D] 2025-05-07T20:32:18.7718261Z x1 = x[:, D:] 2025-05-07T20:32:18.7718472Z 2025-05-07T20:32:18.7718669Z if contiguous: 2025-05-07T20:32:18.7718907Z x0 = x0.contiguous() 2025-05-07T20:32:18.7719178Z x1 = x1.contiguous() 2025-05-07T20:32:18.7719418Z 2025-05-07T20:32:18.7719620Z if scale_ub is not None: 2025-05-07T20:32:18.7719902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.7720251Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.7720603Z ) 2025-05-07T20:32:18.7720806Z else: 2025-05-07T20:32:18.7721019Z scale_ub_tensor = None 2025-05-07T20:32:18.7721282Z 2025-05-07T20:32:18.7721519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.7721831Z op = silu_mul_quant 2025-05-07T20:32:18.7722085Z if compiled: 2025-05-07T20:32:18.7722341Z op = torch.compile(op) 2025-05-07T20:32:18.7722641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7722926Z 2025-05-07T20:32:18.7723119Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.7723282Z 2025-05-07T20:32:18.7723388Z moe/activation_test.py:117: 2025-05-07T20:32:18.7723684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7724022Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.7724311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7724997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.7725688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.7726329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.7727011Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.7727672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.7728208Z kernel = self.compile( 2025-05-07T20:32:18.7728750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.7729444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.7729848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7730086Z 2025-05-07T20:32:18.7730296Z self = 2025-05-07T20:32:18.7731373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.7732826Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc96fe8cae0>} 2025-05-07T20:32:18.7734296Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.7735317Z context = 2025-05-07T20:32:18.7735601Z 2025-05-07T20:32:18.7735775Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.7736301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.7736763Z module_map=module_map) 2025-05-07T20:32:18.7737141Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.7737504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.7737766Z E ^ 2025-05-07T20:32:18.7738236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.7738681Z 2025-05-07T20:32:18.7739101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:18.7739612Z 2025-05-07T20:32:18.7739725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.7740147Z self=, 2025-05-07T20:32:18.7740593Z T=2048, 2025-05-07T20:32:18.7740795Z D=7168, 2025-05-07T20:32:18.7740989Z scale_ub=None, 2025-05-07T20:32:18.7741212Z contiguous=False, 2025-05-07T20:32:18.7741445Z compiled=False, 2025-05-07T20:32:18.7741650Z ) 2025-05-07T20:32:18.7741974Z self = 2025-05-07T20:32:18.7742469Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:18.7742740Z 2025-05-07T20:32:18.7742827Z @given( 2025-05-07T20:32:18.7743064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.7743379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.7743687Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.7744013Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.7744348Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.7744636Z ) 2025-05-07T20:32:18.7744978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.7745420Z def test_silu_mul_quant( 2025-05-07T20:32:18.7745666Z self, 2025-05-07T20:32:18.7745860Z T: int, 2025-05-07T20:32:18.7746064Z D: int, 2025-05-07T20:32:18.7746338Z scale_ub: Optional[float], 2025-05-07T20:32:18.7746608Z contiguous: bool, 2025-05-07T20:32:18.7746853Z compiled: bool, 2025-05-07T20:32:18.7747080Z ) -> None: 2025-05-07T20:32:18.7747296Z torch.manual_seed(2025) 2025-05-07T20:32:18.7747545Z 2025-05-07T20:32:18.7747827Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.7749844Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:18.7751751Z 2025-05-07T20:32:18.7751878Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:18.7752090Z 2025-05-07T20:32:18.7752195Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:18.7752609Z self=, 2025-05-07T20:32:18.7753094Z T=128, 2025-05-07T20:32:18.7753279Z D=7168, 2025-05-07T20:32:18.7753482Z scale_ub=1200.0, 2025-05-07T20:32:18.7753709Z contiguous=True, 2025-05-07T20:32:18.7753936Z compiled=True, 2025-05-07T20:32:18.7754141Z ) 2025-05-07T20:32:18.7754480Z self = 2025-05-07T20:32:18.7754970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:18.7755236Z 2025-05-07T20:32:18.7755316Z @given( 2025-05-07T20:32:18.7755555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:18.7755879Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:18.7756179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:18.7756515Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:18.7756847Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:18.7757141Z ) 2025-05-07T20:32:18.7766114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:18.7766606Z def test_silu_mul_quant( 2025-05-07T20:32:18.7766861Z self, 2025-05-07T20:32:18.7767062Z T: int, 2025-05-07T20:32:18.7767273Z D: int, 2025-05-07T20:32:18.7767508Z scale_ub: Optional[float], 2025-05-07T20:32:18.7767785Z contiguous: bool, 2025-05-07T20:32:18.7768040Z compiled: bool, 2025-05-07T20:32:18.7768279Z ) -> None: 2025-05-07T20:32:18.7768500Z torch.manual_seed(2025) 2025-05-07T20:32:18.7768758Z 2025-05-07T20:32:18.7769046Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:18.7769394Z 2025-05-07T20:32:18.7769603Z x_sign = torch.sign(x) 2025-05-07T20:32:18.7769913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:18.7770266Z x = x_sign * x_clamp 2025-05-07T20:32:18.7770528Z x0 = x[:, :D] 2025-05-07T20:32:18.7770762Z x1 = x[:, D:] 2025-05-07T20:32:18.7770981Z 2025-05-07T20:32:18.7771177Z if contiguous: 2025-05-07T20:32:18.7771422Z x0 = x0.contiguous() 2025-05-07T20:32:18.7771693Z x1 = x1.contiguous() 2025-05-07T20:32:18.7771935Z 2025-05-07T20:32:18.7772143Z if scale_ub is not None: 2025-05-07T20:32:18.7772434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:18.7772773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:18.7773097Z ) 2025-05-07T20:32:18.7773309Z else: 2025-05-07T20:32:18.7773525Z scale_ub_tensor = None 2025-05-07T20:32:18.7773891Z 2025-05-07T20:32:18.7774134Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:18.7774530Z op = silu_mul_quant 2025-05-07T20:32:18.7774785Z if compiled: 2025-05-07T20:32:18.7775032Z op = torch.compile(op) 2025-05-07T20:32:18.7775329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7775615Z 2025-05-07T20:32:18.7775820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:18.7775985Z 2025-05-07T20:32:18.7776087Z moe/activation_test.py:117: 2025-05-07T20:32:18.7776388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7776776Z moe/activation_test.py:115: in fn 2025-05-07T20:32:18.7777065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:18.7777621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:18.7778188Z return fn(*args, **kwargs) 2025-05-07T20:32:18.7778850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:18.7779530Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:18.7780072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:18.7780861Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:18.7781531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:18.7782057Z kernel = self.compile( 2025-05-07T20:32:18.7782604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:18.7783262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:18.7783657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:18.7783894Z 2025-05-07T20:32:18.7784102Z self = 2025-05-07T20:32:18.7785175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:18.7786537Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc96fd10040>} 2025-05-07T20:32:18.7787873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:18.7788875Z context = 2025-05-07T20:32:18.7789165Z 2025-05-07T20:32:18.7789331Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:18.7789849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:18.7790323Z module_map=module_map) 2025-05-07T20:32:18.7790686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:18.7791044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:18.7791308Z E ^ 2025-05-07T20:32:18.7791771Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:18.7792220Z 2025-05-07T20:32:18.7792632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.0865111Z 2025-05-07T20:32:19.0865491Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.0866126Z self=, 2025-05-07T20:32:19.0866681Z T=128, 2025-05-07T20:32:19.0866941Z D=7168, 2025-05-07T20:32:19.0867202Z scale_ub=1200.0, 2025-05-07T20:32:19.0867496Z contiguous=True, 2025-05-07T20:32:19.0868105Z compiled=False, 2025-05-07T20:32:19.0868370Z ) 2025-05-07T20:32:19.0868759Z self = 2025-05-07T20:32:19.0869256Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.0869526Z 2025-05-07T20:32:19.0869623Z @given( 2025-05-07T20:32:19.0869865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0870182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0870489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0870990Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0871317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0871606Z ) 2025-05-07T20:32:19.0871949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.0872397Z def test_silu_mul_quant( 2025-05-07T20:32:19.0872642Z self, 2025-05-07T20:32:19.0872841Z T: int, 2025-05-07T20:32:19.0873043Z D: int, 2025-05-07T20:32:19.0873271Z scale_ub: Optional[float], 2025-05-07T20:32:19.0873543Z contiguous: bool, 2025-05-07T20:32:19.0873786Z compiled: bool, 2025-05-07T20:32:19.0874018Z ) -> None: 2025-05-07T20:32:19.0874394Z torch.manual_seed(2025) 2025-05-07T20:32:19.0874641Z 2025-05-07T20:32:19.0874918Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0875265Z 2025-05-07T20:32:19.0875460Z x_sign = torch.sign(x) 2025-05-07T20:32:19.0875755Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.0877727Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0879554Z 2025-05-07T20:32:19.0879684Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.0879898Z 2025-05-07T20:32:19.0880002Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.0880434Z self=, 2025-05-07T20:32:19.0880872Z T=128, 2025-05-07T20:32:19.0881065Z D=5120, 2025-05-07T20:32:19.0881260Z scale_ub=1200.0, 2025-05-07T20:32:19.0881487Z contiguous=True, 2025-05-07T20:32:19.0881713Z compiled=True, 2025-05-07T20:32:19.0881913Z ) 2025-05-07T20:32:19.0882247Z self = 2025-05-07T20:32:19.0882731Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.0882996Z 2025-05-07T20:32:19.0883084Z @given( 2025-05-07T20:32:19.0883321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0883639Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0883947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0884281Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0884612Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0884901Z ) 2025-05-07T20:32:19.0885245Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.0885692Z def test_silu_mul_quant( 2025-05-07T20:32:19.0885939Z self, 2025-05-07T20:32:19.0886137Z T: int, 2025-05-07T20:32:19.0886340Z D: int, 2025-05-07T20:32:19.0886565Z scale_ub: Optional[float], 2025-05-07T20:32:19.0886842Z contiguous: bool, 2025-05-07T20:32:19.0887084Z compiled: bool, 2025-05-07T20:32:19.0887313Z ) -> None: 2025-05-07T20:32:19.0887585Z torch.manual_seed(2025) 2025-05-07T20:32:19.0887822Z 2025-05-07T20:32:19.0888094Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0888437Z 2025-05-07T20:32:19.0888628Z x_sign = torch.sign(x) 2025-05-07T20:32:19.0888923Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.0890865Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0892706Z 2025-05-07T20:32:19.0892832Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.0893044Z 2025-05-07T20:32:19.0893151Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.0893551Z self=, 2025-05-07T20:32:19.0894110Z T=128, 2025-05-07T20:32:19.0894379Z D=7168, 2025-05-07T20:32:19.0894584Z scale_ub=None, 2025-05-07T20:32:19.0894802Z contiguous=True, 2025-05-07T20:32:19.0895029Z compiled=True, 2025-05-07T20:32:19.0895226Z ) 2025-05-07T20:32:19.0895547Z self = 2025-05-07T20:32:19.0896040Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.0896303Z 2025-05-07T20:32:19.0896384Z @given( 2025-05-07T20:32:19.0896617Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0896933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0897239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0897574Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0897906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0898537Z ) 2025-05-07T20:32:19.0898917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.0899369Z def test_silu_mul_quant( 2025-05-07T20:32:19.0899612Z self, 2025-05-07T20:32:19.0899805Z T: int, 2025-05-07T20:32:19.0900008Z D: int, 2025-05-07T20:32:19.0900232Z scale_ub: Optional[float], 2025-05-07T20:32:19.0900529Z contiguous: bool, 2025-05-07T20:32:19.0900798Z compiled: bool, 2025-05-07T20:32:19.0901027Z ) -> None: 2025-05-07T20:32:19.0901241Z torch.manual_seed(2025) 2025-05-07T20:32:19.0901484Z 2025-05-07T20:32:19.0901763Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0903760Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0905557Z 2025-05-07T20:32:19.0905682Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.0905894Z 2025-05-07T20:32:19.0912945Z FAILED 2025-05-07T20:32:19.0913105Z 2025-05-07T20:32:19.0913310Z =================================== FAILURES =================================== 2025-05-07T20:32:19.0913917Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:19.0914546Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:19.0915438Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:19.0916924Z | yield 2025-05-07T20:32:19.0917508Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:19.0918205Z | self._callTestMethod(testMethod) 2025-05-07T20:32:19.0918596Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:19.0919340Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:19.0920193Z | if method() is not None: 2025-05-07T20:32:19.0920588Z | ~~~~~~^^ 2025-05-07T20:32:19.0921449Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:19.0922427Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0922838Z | ^^^^^^^ 2025-05-07T20:32:19.0923599Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:19.0924456Z | raise the_error_hypothesis_found 2025-05-07T20:32:19.0925026Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:19.0925736Z +-+---------------- 1 ---------------- 2025-05-07T20:32:19.0926139Z | Traceback (most recent call last): 2025-05-07T20:32:19.0927098Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:19.0928164Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0931019Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0933825Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:19.0934422Z | self=, 2025-05-07T20:32:19.0934975Z | T=2048, 2025-05-07T20:32:19.0935295Z | D=5120, # or any other generated value 2025-05-07T20:32:19.0935746Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:19.0936237Z | contiguous=True, # or any other generated value 2025-05-07T20:32:19.0936729Z | compiled=False, # or any other generated value 2025-05-07T20:32:19.0937143Z | ) 2025-05-07T20:32:19.0937390Z | 2025-05-07T20:32:19.0938108Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:19.0938936Z +---------------- 2 ---------------- 2025-05-07T20:32:19.0939325Z | Traceback (most recent call last): 2025-05-07T20:32:19.0940323Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:19.0941470Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0944275Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0947033Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:19.0947583Z | self=, 2025-05-07T20:32:19.0947999Z | T=128, 2025-05-07T20:32:19.0948209Z | D=7168, 2025-05-07T20:32:19.0948418Z | scale_ub=None, 2025-05-07T20:32:19.0948666Z | contiguous=True, 2025-05-07T20:32:19.0948974Z | compiled=True, 2025-05-07T20:32:19.0949196Z | ) 2025-05-07T20:32:19.0949382Z | 2025-05-07T20:32:19.0949907Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:19.0950511Z +---------------- 3 ---------------- 2025-05-07T20:32:19.0950803Z | Traceback (most recent call last): 2025-05-07T20:32:19.0951510Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:19.0952358Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.0954469Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.0956393Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:19.0956826Z | self=, 2025-05-07T20:32:19.0957239Z | T=128, 2025-05-07T20:32:19.0957444Z | D=5120, 2025-05-07T20:32:19.0957655Z | scale_ub=1200.0, 2025-05-07T20:32:19.0957905Z | contiguous=True, 2025-05-07T20:32:19.0958151Z | compiled=True, 2025-05-07T20:32:19.0958375Z | ) 2025-05-07T20:32:19.0958567Z | 2025-05-07T20:32:19.0959088Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:19.0959687Z +---------------- 4 ---------------- 2025-05-07T20:32:19.0959986Z | Traceback (most recent call last): 2025-05-07T20:32:19.0960691Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:19.0961405Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.0961694Z | ~~~~~~^^ 2025-05-07T20:32:19.0962334Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:19.0963031Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.0963862Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:19.0964644Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.0964942Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:19.0965215Z | a, 2025-05-07T20:32:19.0965415Z | ^^ 2025-05-07T20:32:19.0965631Z | ...<23 lines>... 2025-05-07T20:32:19.0965879Z | USE_INT64=use_int64, 2025-05-07T20:32:19.0966144Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.0966393Z | ) 2025-05-07T20:32:19.0966585Z | ^ 2025-05-07T20:32:19.0967102Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:19.0967892Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.0968348Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.0969002Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:19.0969770Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.0970307Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.0970984Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:19.0971678Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.0972058Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:19.0972667Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:19.0973232Z | fn() 2025-05-07T20:32:19.0973433Z | ~~^^ 2025-05-07T20:32:19.0974241Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:19.0974875Z | self.fn.run( 2025-05-07T20:32:19.0975102Z | ~~~~~~~~~~~^ 2025-05-07T20:32:19.0975322Z | *args, 2025-05-07T20:32:19.0975544Z | ^^^^^^ 2025-05-07T20:32:19.0975765Z | **current, 2025-05-07T20:32:19.0975991Z | ^^^^^^^^^^ 2025-05-07T20:32:19.0976220Z | ) 2025-05-07T20:32:19.0976419Z | ^ 2025-05-07T20:32:19.0976906Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:19.0977493Z | kernel = self.compile( 2025-05-07T20:32:19.0977756Z | src, 2025-05-07T20:32:19.0977975Z | target=target, 2025-05-07T20:32:19.0978240Z | options=options.__dict__, 2025-05-07T20:32:19.0978516Z | ) 2025-05-07T20:32:19.0979063Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:19.0979759Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.0980465Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:19.0981245Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.0981711Z | module_map=module_map) 2025-05-07T20:32:19.0982082Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.0982439Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.0982712Z | ^ 2025-05-07T20:32:19.0983166Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.0983727Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:19.0984137Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:19.0984647Z | self=, 2025-05-07T20:32:19.0985087Z | T=1, # or any other generated value 2025-05-07T20:32:19.0985409Z | D=5120, # or any other generated value 2025-05-07T20:32:19.0985754Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:19.0986119Z | contiguous=True, # or any other generated value 2025-05-07T20:32:19.0986488Z | compiled=True, # or any other generated value 2025-05-07T20:32:19.0986795Z | ) 2025-05-07T20:32:19.0986976Z | 2025-05-07T20:32:19.0987568Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:19.0988174Z +------------------------------------ 2025-05-07T20:32:19.0988537Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:19.0988917Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.0989332Z self=, 2025-05-07T20:32:19.0989732Z T=1, 2025-05-07T20:32:19.0989964Z D=5120, 2025-05-07T20:32:19.0990162Z scale_ub=None, 2025-05-07T20:32:19.0990428Z contiguous=True, 2025-05-07T20:32:19.0990673Z compiled=True, 2025-05-07T20:32:19.0990888Z ) 2025-05-07T20:32:19.0991214Z self = 2025-05-07T20:32:19.0991697Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.0991963Z 2025-05-07T20:32:19.0992049Z @given( 2025-05-07T20:32:19.0992292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.0992608Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.0992920Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.0993346Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.0993780Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.0994145Z ) 2025-05-07T20:32:19.1015966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1016673Z def test_silu_mul_quant( 2025-05-07T20:32:19.1017017Z self, 2025-05-07T20:32:19.1017289Z T: int, 2025-05-07T20:32:19.1017559Z D: int, 2025-05-07T20:32:19.1017854Z scale_ub: Optional[float], 2025-05-07T20:32:19.1018242Z contiguous: bool, 2025-05-07T20:32:19.1018565Z compiled: bool, 2025-05-07T20:32:19.1018869Z ) -> None: 2025-05-07T20:32:19.1019155Z torch.manual_seed(2025) 2025-05-07T20:32:19.1019478Z 2025-05-07T20:32:19.1019848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1020326Z 2025-05-07T20:32:19.1020596Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1020994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1021423Z x = x_sign * x_clamp 2025-05-07T20:32:19.1021757Z x0 = x[:, :D] 2025-05-07T20:32:19.1022077Z x1 = x[:, D:] 2025-05-07T20:32:19.1022377Z 2025-05-07T20:32:19.1022631Z if contiguous: 2025-05-07T20:32:19.1022961Z x0 = x0.contiguous() 2025-05-07T20:32:19.1023320Z x1 = x1.contiguous() 2025-05-07T20:32:19.1023658Z 2025-05-07T20:32:19.1023918Z if scale_ub is not None: 2025-05-07T20:32:19.1024295Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1024757Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1025174Z ) 2025-05-07T20:32:19.1025446Z else: 2025-05-07T20:32:19.1025745Z scale_ub_tensor = None 2025-05-07T20:32:19.1026085Z 2025-05-07T20:32:19.1026406Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1026844Z op = silu_mul_quant 2025-05-07T20:32:19.1027186Z if compiled: 2025-05-07T20:32:19.1027541Z op = torch.compile(op) 2025-05-07T20:32:19.1027954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1028328Z 2025-05-07T20:32:19.1028596Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1028991Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1029388Z 2025-05-07T20:32:19.1029717Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1030168Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1030548Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1030950Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1031410Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1032086Z 2025-05-07T20:32:19.1032347Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1032595Z 2025-05-07T20:32:19.1032730Z moe/activation_test.py:126: 2025-05-07T20:32:19.1033112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1033551Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1033996Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1035096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1036254Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1037011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1037935Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1038874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1039858Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1041019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1041893Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1042697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1043406Z fn() 2025-05-07T20:32:19.1044098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1044872Z self.fn.run( 2025-05-07T20:32:19.1045502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1046231Z kernel = self.compile( 2025-05-07T20:32:19.1046969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1047834Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1048395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1048721Z 2025-05-07T20:32:19.1049002Z self = 2025-05-07T20:32:19.1050464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1052339Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce612f36a0>} 2025-05-07T20:32:19.1054244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1055580Z context = 2025-05-07T20:32:19.1055962Z 2025-05-07T20:32:19.1056186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1056853Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1057472Z module_map=module_map) 2025-05-07T20:32:19.1057950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1058402Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1058747Z E ^ 2025-05-07T20:32:19.1059359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1059953Z 2025-05-07T20:32:19.1060647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1061340Z 2025-05-07T20:32:19.1061483Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1062052Z self=, 2025-05-07T20:32:19.1062593Z T=2048, 2025-05-07T20:32:19.1062855Z D=5120, 2025-05-07T20:32:19.1063114Z scale_ub=1200.0, 2025-05-07T20:32:19.1063427Z contiguous=True, 2025-05-07T20:32:19.1063802Z compiled=False, 2025-05-07T20:32:19.1064085Z ) 2025-05-07T20:32:19.1064528Z self = 2025-05-07T20:32:19.1065214Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1065593Z 2025-05-07T20:32:19.1065701Z @given( 2025-05-07T20:32:19.1066020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1066451Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1066873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1067331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1067782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1068168Z ) 2025-05-07T20:32:19.1068720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1069306Z def test_silu_mul_quant( 2025-05-07T20:32:19.1069617Z self, 2025-05-07T20:32:19.1069866Z T: int, 2025-05-07T20:32:19.1070142Z D: int, 2025-05-07T20:32:19.1070435Z scale_ub: Optional[float], 2025-05-07T20:32:19.1070786Z contiguous: bool, 2025-05-07T20:32:19.1071110Z compiled: bool, 2025-05-07T20:32:19.1071419Z ) -> None: 2025-05-07T20:32:19.1071707Z torch.manual_seed(2025) 2025-05-07T20:32:19.1072037Z 2025-05-07T20:32:19.1072401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1072866Z 2025-05-07T20:32:19.1073133Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1073527Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1073939Z x = x_sign * x_clamp 2025-05-07T20:32:19.1074264Z x0 = x[:, :D] 2025-05-07T20:32:19.1074566Z x1 = x[:, D:] 2025-05-07T20:32:19.1074850Z 2025-05-07T20:32:19.1075095Z if contiguous: 2025-05-07T20:32:19.1075389Z x0 = x0.contiguous() 2025-05-07T20:32:19.1075713Z x1 = x1.contiguous() 2025-05-07T20:32:19.1076007Z 2025-05-07T20:32:19.1076248Z if scale_ub is not None: 2025-05-07T20:32:19.1076593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1077009Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1077405Z ) 2025-05-07T20:32:19.1077653Z else: 2025-05-07T20:32:19.1077926Z scale_ub_tensor = None 2025-05-07T20:32:19.1078269Z 2025-05-07T20:32:19.1078579Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1079010Z op = silu_mul_quant 2025-05-07T20:32:19.1079359Z if compiled: 2025-05-07T20:32:19.1079698Z op = torch.compile(op) 2025-05-07T20:32:19.1080100Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1080491Z 2025-05-07T20:32:19.1080760Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1080983Z 2025-05-07T20:32:19.1081124Z moe/activation_test.py:117: 2025-05-07T20:32:19.1081522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1081982Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1082367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1083254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1084131Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1084816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1085732Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1086563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1087251Z kernel = self.compile( 2025-05-07T20:32:19.1087932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1088753Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1089341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1089638Z 2025-05-07T20:32:19.1089891Z self = 2025-05-07T20:32:19.1091250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1093002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60f61f80>} 2025-05-07T20:32:19.1094925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1096253Z context = 2025-05-07T20:32:19.1096620Z 2025-05-07T20:32:19.1096838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1097508Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1098105Z module_map=module_map) 2025-05-07T20:32:19.1098865Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1099326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1099663Z E ^ 2025-05-07T20:32:19.1100280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1100940Z 2025-05-07T20:32:19.1101499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1102178Z 2025-05-07T20:32:19.1102320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1102852Z self=, 2025-05-07T20:32:19.1103381Z T=2048, 2025-05-07T20:32:19.1103626Z D=5120, 2025-05-07T20:32:19.1103873Z scale_ub=1200.0, 2025-05-07T20:32:19.1104186Z contiguous=True, 2025-05-07T20:32:19.1104492Z compiled=True, 2025-05-07T20:32:19.1104758Z ) 2025-05-07T20:32:19.1105179Z self = 2025-05-07T20:32:19.1105838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1106180Z 2025-05-07T20:32:19.1106301Z @given( 2025-05-07T20:32:19.1106583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1106986Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1107371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1107774Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1108189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1108555Z ) 2025-05-07T20:32:19.1108986Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1109576Z def test_silu_mul_quant( 2025-05-07T20:32:19.1109907Z self, 2025-05-07T20:32:19.1110165Z T: int, 2025-05-07T20:32:19.1110431Z D: int, 2025-05-07T20:32:19.1110729Z scale_ub: Optional[float], 2025-05-07T20:32:19.1111233Z contiguous: bool, 2025-05-07T20:32:19.1111566Z compiled: bool, 2025-05-07T20:32:19.1111878Z ) -> None: 2025-05-07T20:32:19.1112177Z torch.manual_seed(2025) 2025-05-07T20:32:19.1112501Z 2025-05-07T20:32:19.1112874Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1113345Z 2025-05-07T20:32:19.1113594Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1113979Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1114406Z x = x_sign * x_clamp 2025-05-07T20:32:19.1114872Z x0 = x[:, :D] 2025-05-07T20:32:19.1115167Z x1 = x[:, D:] 2025-05-07T20:32:19.1115446Z 2025-05-07T20:32:19.1115709Z if contiguous: 2025-05-07T20:32:19.1116026Z x0 = x0.contiguous() 2025-05-07T20:32:19.1116371Z x1 = x1.contiguous() 2025-05-07T20:32:19.1116708Z 2025-05-07T20:32:19.1116976Z if scale_ub is not None: 2025-05-07T20:32:19.1117344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1117816Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1118231Z ) 2025-05-07T20:32:19.1118487Z else: 2025-05-07T20:32:19.1120091Z scale_ub_tensor = None 2025-05-07T20:32:19.1120439Z 2025-05-07T20:32:19.1120903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1121340Z op = silu_mul_quant 2025-05-07T20:32:19.1121684Z if compiled: 2025-05-07T20:32:19.1122024Z op = torch.compile(op) 2025-05-07T20:32:19.1122394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1122770Z 2025-05-07T20:32:19.1123038Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1123401Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1123793Z 2025-05-07T20:32:19.1124115Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1124560Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1124973Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1125413Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1125903Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1126332Z 2025-05-07T20:32:19.1126613Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1126876Z 2025-05-07T20:32:19.1127019Z moe/activation_test.py:126: 2025-05-07T20:32:19.1127409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1127859Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1128297Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1129352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1130371Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1131098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1132013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1132924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1134010Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1135005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1135876Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1136681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1137271Z fn() 2025-05-07T20:32:19.1137771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1138413Z self.fn.run( 2025-05-07T20:32:19.1138875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1139395Z kernel = self.compile( 2025-05-07T20:32:19.1139928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1140569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1140963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1141235Z 2025-05-07T20:32:19.1141445Z self = 2025-05-07T20:32:19.1142496Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1143849Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60bf07c0>} 2025-05-07T20:32:19.1145249Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1146254Z context = 2025-05-07T20:32:19.1146535Z 2025-05-07T20:32:19.1146705Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1147207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1147664Z module_map=module_map) 2025-05-07T20:32:19.1148022Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1148366Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1148626Z E ^ 2025-05-07T20:32:19.1149079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1149515Z 2025-05-07T20:32:19.1149929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1150429Z 2025-05-07T20:32:19.1150530Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1150933Z self=, 2025-05-07T20:32:19.1151329Z T=16384, 2025-05-07T20:32:19.1151513Z D=7168, 2025-05-07T20:32:19.1151703Z scale_ub=1200.0, 2025-05-07T20:32:19.1151923Z contiguous=False, 2025-05-07T20:32:19.1152140Z compiled=False, 2025-05-07T20:32:19.1152337Z ) 2025-05-07T20:32:19.1152651Z self = 2025-05-07T20:32:19.1153144Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1153417Z 2025-05-07T20:32:19.1153490Z @given( 2025-05-07T20:32:19.1153714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1154021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1154312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1154636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1154957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1155237Z ) 2025-05-07T20:32:19.1155576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1156011Z def test_silu_mul_quant( 2025-05-07T20:32:19.1156251Z self, 2025-05-07T20:32:19.1156441Z T: int, 2025-05-07T20:32:19.1156635Z D: int, 2025-05-07T20:32:19.1156845Z scale_ub: Optional[float], 2025-05-07T20:32:19.1157102Z contiguous: bool, 2025-05-07T20:32:19.1157337Z compiled: bool, 2025-05-07T20:32:19.1157555Z ) -> None: 2025-05-07T20:32:19.1157810Z torch.manual_seed(2025) 2025-05-07T20:32:19.1158048Z 2025-05-07T20:32:19.1158316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1158646Z 2025-05-07T20:32:19.1158836Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1159127Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1159426Z x = x_sign * x_clamp 2025-05-07T20:32:19.1159663Z x0 = x[:, :D] 2025-05-07T20:32:19.1159877Z x1 = x[:, D:] 2025-05-07T20:32:19.1160076Z 2025-05-07T20:32:19.1160309Z if contiguous: 2025-05-07T20:32:19.1160561Z x0 = x0.contiguous() 2025-05-07T20:32:19.1160839Z x1 = x1.contiguous() 2025-05-07T20:32:19.1161073Z 2025-05-07T20:32:19.1161268Z if scale_ub is not None: 2025-05-07T20:32:19.1161536Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1161856Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1162159Z ) 2025-05-07T20:32:19.1162353Z else: 2025-05-07T20:32:19.1162552Z scale_ub_tensor = None 2025-05-07T20:32:19.1162797Z 2025-05-07T20:32:19.1163022Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1163323Z op = silu_mul_quant 2025-05-07T20:32:19.1163647Z if compiled: 2025-05-07T20:32:19.1163889Z op = torch.compile(op) 2025-05-07T20:32:19.1164176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1164447Z 2025-05-07T20:32:19.1164638Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1164799Z 2025-05-07T20:32:19.1164904Z moe/activation_test.py:117: 2025-05-07T20:32:19.1165185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1165515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1165795Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1166464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1167139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1167665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1168335Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1168977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1169497Z kernel = self.compile( 2025-05-07T20:32:19.1170035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1170718Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1171108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1171339Z 2025-05-07T20:32:19.1171543Z self = 2025-05-07T20:32:19.1172596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1174052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60bd5440>} 2025-05-07T20:32:19.1175372Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1176375Z context = 2025-05-07T20:32:19.1176654Z 2025-05-07T20:32:19.1176822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1177334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1177867Z module_map=module_map) 2025-05-07T20:32:19.1178227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1178576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1178832Z E ^ 2025-05-07T20:32:19.1179287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1179724Z 2025-05-07T20:32:19.1180178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1180678Z 2025-05-07T20:32:19.1180785Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1181181Z self=, 2025-05-07T20:32:19.1181575Z T=1, 2025-05-07T20:32:19.1181758Z D=7168, 2025-05-07T20:32:19.1181942Z scale_ub=None, 2025-05-07T20:32:19.1182158Z contiguous=True, 2025-05-07T20:32:19.1182379Z compiled=True, 2025-05-07T20:32:19.1182571Z ) 2025-05-07T20:32:19.1182882Z self = 2025-05-07T20:32:19.1183436Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1183691Z 2025-05-07T20:32:19.1183764Z @given( 2025-05-07T20:32:19.1183990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1184301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1184602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1184918Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1185239Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1185515Z ) 2025-05-07T20:32:19.1185851Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1186282Z def test_silu_mul_quant( 2025-05-07T20:32:19.1186520Z self, 2025-05-07T20:32:19.1186703Z T: int, 2025-05-07T20:32:19.1186893Z D: int, 2025-05-07T20:32:19.1187109Z scale_ub: Optional[float], 2025-05-07T20:32:19.1187368Z contiguous: bool, 2025-05-07T20:32:19.1187603Z compiled: bool, 2025-05-07T20:32:19.1187820Z ) -> None: 2025-05-07T20:32:19.1188029Z torch.manual_seed(2025) 2025-05-07T20:32:19.1188266Z 2025-05-07T20:32:19.1188532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1188874Z 2025-05-07T20:32:19.1189061Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1189344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1189645Z x = x_sign * x_clamp 2025-05-07T20:32:19.1189873Z x0 = x[:, :D] 2025-05-07T20:32:19.1190098Z x1 = x[:, D:] 2025-05-07T20:32:19.1199099Z 2025-05-07T20:32:19.1199299Z if contiguous: 2025-05-07T20:32:19.1199548Z x0 = x0.contiguous() 2025-05-07T20:32:19.1199817Z x1 = x1.contiguous() 2025-05-07T20:32:19.1200054Z 2025-05-07T20:32:19.1200257Z if scale_ub is not None: 2025-05-07T20:32:19.1200534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1200867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1201183Z ) 2025-05-07T20:32:19.1201384Z else: 2025-05-07T20:32:19.1201480Z scale_ub_tensor = None 2025-05-07T20:32:19.1201553Z 2025-05-07T20:32:19.1201692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1201786Z op = silu_mul_quant 2025-05-07T20:32:19.1201872Z if compiled: 2025-05-07T20:32:19.1201983Z op = torch.compile(op) 2025-05-07T20:32:19.1202089Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1202172Z 2025-05-07T20:32:19.1202262Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1202383Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1202631Z 2025-05-07T20:32:19.1202768Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1202870Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1202979Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1203110Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1203248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1203330Z 2025-05-07T20:32:19.1203430Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1203506Z 2025-05-07T20:32:19.1203615Z moe/activation_test.py:126: 2025-05-07T20:32:19.1203748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1203855Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1203997Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1204552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1204657Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1205019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1205362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1205740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1205991Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1206364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1206534Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1206870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1206958Z fn() 2025-05-07T20:32:19.1207354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1207442Z self.fn.run( 2025-05-07T20:32:19.1207781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1207875Z kernel = self.compile( 2025-05-07T20:32:19.1208253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1208433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1208561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1208566Z 2025-05-07T20:32:19.1208779Z self = 2025-05-07T20:32:19.1209545Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1210055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b40b7e0>} 2025-05-07T20:32:19.1210793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1210986Z context = 2025-05-07T20:32:19.1210991Z 2025-05-07T20:32:19.1211161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1211422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1211533Z module_map=module_map) 2025-05-07T20:32:19.1211746Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1211850Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1211938Z E ^ 2025-05-07T20:32:19.1212294Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1212298Z 2025-05-07T20:32:19.1212703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1212708Z 2025-05-07T20:32:19.1212860Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1213082Z self=, 2025-05-07T20:32:19.1213170Z T=4096, 2025-05-07T20:32:19.1213248Z D=5120, 2025-05-07T20:32:19.1213332Z scale_ub=None, 2025-05-07T20:32:19.1213431Z contiguous=False, 2025-05-07T20:32:19.1213515Z compiled=False, 2025-05-07T20:32:19.1213589Z ) 2025-05-07T20:32:19.1213896Z self = 2025-05-07T20:32:19.1214076Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1214081Z 2025-05-07T20:32:19.1214166Z @given( 2025-05-07T20:32:19.1214364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1214467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1214591Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1214709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1214827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1214912Z ) 2025-05-07T20:32:19.1215157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1215251Z def test_silu_mul_quant( 2025-05-07T20:32:19.1215336Z self, 2025-05-07T20:32:19.1215413Z T: int, 2025-05-07T20:32:19.1215490Z D: int, 2025-05-07T20:32:19.1215597Z scale_ub: Optional[float], 2025-05-07T20:32:19.1215690Z contiguous: bool, 2025-05-07T20:32:19.1215781Z compiled: bool, 2025-05-07T20:32:19.1215861Z ) -> None: 2025-05-07T20:32:19.1215957Z torch.manual_seed(2025) 2025-05-07T20:32:19.1216037Z 2025-05-07T20:32:19.1216214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1216288Z 2025-05-07T20:32:19.1216388Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1216510Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1216603Z x = x_sign * x_clamp 2025-05-07T20:32:19.1216692Z x0 = x[:, :D] 2025-05-07T20:32:19.1216772Z x1 = x[:, D:] 2025-05-07T20:32:19.1216846Z 2025-05-07T20:32:19.1216939Z if contiguous: 2025-05-07T20:32:19.1217034Z x0 = x0.contiguous() 2025-05-07T20:32:19.1217132Z x1 = x1.contiguous() 2025-05-07T20:32:19.1217205Z 2025-05-07T20:32:19.1217294Z if scale_ub is not None: 2025-05-07T20:32:19.1217412Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1217543Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1217620Z ) 2025-05-07T20:32:19.1217701Z else: 2025-05-07T20:32:19.1217794Z scale_ub_tensor = None 2025-05-07T20:32:19.1217872Z 2025-05-07T20:32:19.1218006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1218096Z op = silu_mul_quant 2025-05-07T20:32:19.1218180Z if compiled: 2025-05-07T20:32:19.1218288Z op = torch.compile(op) 2025-05-07T20:32:19.1218394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1218472Z 2025-05-07T20:32:19.1218562Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1218567Z 2025-05-07T20:32:19.1218663Z moe/activation_test.py:117: 2025-05-07T20:32:19.1218798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1218898Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1219045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1219541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1219638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1220004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1220243Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1220652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1220754Z kernel = self.compile( 2025-05-07T20:32:19.1221133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1221307Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1221445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1221450Z 2025-05-07T20:32:19.1221652Z self = 2025-05-07T20:32:19.1222518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1223018Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba78400>} 2025-05-07T20:32:19.1223760Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1223951Z context = 2025-05-07T20:32:19.1223959Z 2025-05-07T20:32:19.1224122Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1224388Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1224502Z module_map=module_map) 2025-05-07T20:32:19.1224663Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1224768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1224845Z E ^ 2025-05-07T20:32:19.1225205Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1225210Z 2025-05-07T20:32:19.1225616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1225620Z 2025-05-07T20:32:19.1225724Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1225951Z self=, 2025-05-07T20:32:19.1226031Z T=4096, 2025-05-07T20:32:19.1226114Z D=7168, 2025-05-07T20:32:19.1226197Z scale_ub=None, 2025-05-07T20:32:19.1226284Z contiguous=False, 2025-05-07T20:32:19.1226375Z compiled=False, 2025-05-07T20:32:19.1226448Z ) 2025-05-07T20:32:19.1226668Z self = 2025-05-07T20:32:19.1226848Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1226852Z 2025-05-07T20:32:19.1226932Z @given( 2025-05-07T20:32:19.1227051Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1227156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1227271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1227397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1227511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1227585Z ) 2025-05-07T20:32:19.1227885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1227980Z def test_silu_mul_quant( 2025-05-07T20:32:19.1228058Z self, 2025-05-07T20:32:19.1228141Z T: int, 2025-05-07T20:32:19.1228220Z D: int, 2025-05-07T20:32:19.1228322Z scale_ub: Optional[float], 2025-05-07T20:32:19.1228419Z contiguous: bool, 2025-05-07T20:32:19.1228505Z compiled: bool, 2025-05-07T20:32:19.1228586Z ) -> None: 2025-05-07T20:32:19.1228689Z torch.manual_seed(2025) 2025-05-07T20:32:19.1228805Z 2025-05-07T20:32:19.1228979Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1229053Z 2025-05-07T20:32:19.1229144Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1229275Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1229363Z x = x_sign * x_clamp 2025-05-07T20:32:19.1229445Z x0 = x[:, :D] 2025-05-07T20:32:19.1229537Z x1 = x[:, D:] 2025-05-07T20:32:19.1229609Z 2025-05-07T20:32:19.1229694Z if contiguous: 2025-05-07T20:32:19.1229790Z x0 = x0.contiguous() 2025-05-07T20:32:19.1229879Z x1 = x1.contiguous() 2025-05-07T20:32:19.1229952Z 2025-05-07T20:32:19.1230126Z if scale_ub is not None: 2025-05-07T20:32:19.1230233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1230366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1230453Z ) 2025-05-07T20:32:19.1230536Z else: 2025-05-07T20:32:19.1230638Z scale_ub_tensor = None 2025-05-07T20:32:19.1230712Z 2025-05-07T20:32:19.1230841Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1230939Z op = silu_mul_quant 2025-05-07T20:32:19.1231023Z if compiled: 2025-05-07T20:32:19.1231123Z op = torch.compile(op) 2025-05-07T20:32:19.1231235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1231312Z 2025-05-07T20:32:19.1231404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1231408Z 2025-05-07T20:32:19.1231512Z moe/activation_test.py:117: 2025-05-07T20:32:19.1231642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1231754Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1231853Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1232343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1232450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1232805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1233024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1233366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1233462Z kernel = self.compile( 2025-05-07T20:32:19.1233846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1234017Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1234150Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1234154Z 2025-05-07T20:32:19.1234364Z self = 2025-05-07T20:32:19.1235128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1235635Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba7b600>} 2025-05-07T20:32:19.1236409Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1236603Z context = 2025-05-07T20:32:19.1236615Z 2025-05-07T20:32:19.1236776Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1237033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1237187Z module_map=module_map) 2025-05-07T20:32:19.1237349Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1237448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1237532Z E ^ 2025-05-07T20:32:19.1237880Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1237888Z 2025-05-07T20:32:19.1238298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1238303Z 2025-05-07T20:32:19.1238406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1238699Z self=, 2025-05-07T20:32:19.1238787Z T=128, 2025-05-07T20:32:19.1238865Z D=7168, 2025-05-07T20:32:19.1238948Z scale_ub=None, 2025-05-07T20:32:19.1239043Z contiguous=False, 2025-05-07T20:32:19.1239131Z compiled=True, 2025-05-07T20:32:19.1239207Z ) 2025-05-07T20:32:19.1239434Z self = 2025-05-07T20:32:19.1239601Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1239605Z 2025-05-07T20:32:19.1239687Z @given( 2025-05-07T20:32:19.1239806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1239910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1240030Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1240156Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1240279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1240383Z ) 2025-05-07T20:32:19.1240627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1240725Z def test_silu_mul_quant( 2025-05-07T20:32:19.1240804Z self, 2025-05-07T20:32:19.1240883Z T: int, 2025-05-07T20:32:19.1240966Z D: int, 2025-05-07T20:32:19.1241060Z scale_ub: Optional[float], 2025-05-07T20:32:19.1241147Z contiguous: bool, 2025-05-07T20:32:19.1241237Z compiled: bool, 2025-05-07T20:32:19.1241314Z ) -> None: 2025-05-07T20:32:19.1241406Z torch.manual_seed(2025) 2025-05-07T20:32:19.1241480Z 2025-05-07T20:32:19.1241644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1241722Z 2025-05-07T20:32:19.1241816Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1241938Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1242022Z x = x_sign * x_clamp 2025-05-07T20:32:19.1242107Z x0 = x[:, :D] 2025-05-07T20:32:19.1242191Z x1 = x[:, D:] 2025-05-07T20:32:19.1242272Z 2025-05-07T20:32:19.1242352Z if contiguous: 2025-05-07T20:32:19.1242441Z x0 = x0.contiguous() 2025-05-07T20:32:19.1242536Z x1 = x1.contiguous() 2025-05-07T20:32:19.1242608Z 2025-05-07T20:32:19.1242698Z if scale_ub is not None: 2025-05-07T20:32:19.1242808Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1242939Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1243013Z ) 2025-05-07T20:32:19.1243095Z else: 2025-05-07T20:32:19.1243191Z scale_ub_tensor = None 2025-05-07T20:32:19.1243263Z 2025-05-07T20:32:19.1243456Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1243550Z op = silu_mul_quant 2025-05-07T20:32:19.1243644Z if compiled: 2025-05-07T20:32:19.1243749Z op = torch.compile(op) 2025-05-07T20:32:19.1243867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1243951Z 2025-05-07T20:32:19.1244046Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1244174Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1244255Z 2025-05-07T20:32:19.1244434Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1244537Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1244644Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1244765Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1244901Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1244981Z 2025-05-07T20:32:19.1245084Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1245088Z 2025-05-07T20:32:19.1245192Z moe/activation_test.py:126: 2025-05-07T20:32:19.1245317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1245420Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1245633Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1246178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1246281Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1246641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1246857Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1247222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1247474Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1247840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1248013Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1248348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1248430Z fn() 2025-05-07T20:32:19.1248824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1248902Z self.fn.run( 2025-05-07T20:32:19.1249239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1249329Z kernel = self.compile( 2025-05-07T20:32:19.1249704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1249888Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1250012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1250016Z 2025-05-07T20:32:19.1250230Z self = 2025-05-07T20:32:19.1251040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1251539Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ba7a020>} 2025-05-07T20:32:19.1252271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1252502Z context = 2025-05-07T20:32:19.1252507Z 2025-05-07T20:32:19.1252678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1252932Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1253037Z module_map=module_map) 2025-05-07T20:32:19.1253242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1253343Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1253426Z E ^ 2025-05-07T20:32:19.1253855Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1253860Z 2025-05-07T20:32:19.1254259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1254266Z 2025-05-07T20:32:19.1254377Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1254595Z self=, 2025-05-07T20:32:19.1254681Z T=128, 2025-05-07T20:32:19.1254757Z D=7168, 2025-05-07T20:32:19.1254938Z scale_ub=None, 2025-05-07T20:32:19.1255028Z contiguous=False, 2025-05-07T20:32:19.1255110Z compiled=False, 2025-05-07T20:32:19.1255179Z ) 2025-05-07T20:32:19.1255399Z self = 2025-05-07T20:32:19.1255568Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1255572Z 2025-05-07T20:32:19.1255647Z @given( 2025-05-07T20:32:19.1255769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1255866Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1255984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1256101Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1256212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1256293Z ) 2025-05-07T20:32:19.1256533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1256631Z def test_silu_mul_quant( 2025-05-07T20:32:19.1256711Z self, 2025-05-07T20:32:19.1256788Z T: int, 2025-05-07T20:32:19.1256864Z D: int, 2025-05-07T20:32:19.1256966Z scale_ub: Optional[float], 2025-05-07T20:32:19.1257053Z contiguous: bool, 2025-05-07T20:32:19.1257135Z compiled: bool, 2025-05-07T20:32:19.1257216Z ) -> None: 2025-05-07T20:32:19.1257308Z torch.manual_seed(2025) 2025-05-07T20:32:19.1257386Z 2025-05-07T20:32:19.1257551Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1257625Z 2025-05-07T20:32:19.1257722Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1257844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1257935Z x = x_sign * x_clamp 2025-05-07T20:32:19.1258020Z x0 = x[:, :D] 2025-05-07T20:32:19.1258098Z x1 = x[:, D:] 2025-05-07T20:32:19.1258186Z 2025-05-07T20:32:19.1258265Z if contiguous: 2025-05-07T20:32:19.1258365Z x0 = x0.contiguous() 2025-05-07T20:32:19.1258451Z x1 = x1.contiguous() 2025-05-07T20:32:19.1258524Z 2025-05-07T20:32:19.1258619Z if scale_ub is not None: 2025-05-07T20:32:19.1258719Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1258855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1258940Z ) 2025-05-07T20:32:19.1259016Z else: 2025-05-07T20:32:19.1259107Z scale_ub_tensor = None 2025-05-07T20:32:19.1259189Z 2025-05-07T20:32:19.1259317Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1259410Z op = silu_mul_quant 2025-05-07T20:32:19.1259545Z if compiled: 2025-05-07T20:32:19.1259642Z op = torch.compile(op) 2025-05-07T20:32:19.1259755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1259826Z 2025-05-07T20:32:19.1259915Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1259920Z 2025-05-07T20:32:19.1260027Z moe/activation_test.py:117: 2025-05-07T20:32:19.1260153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1260252Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1260399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1260884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1260983Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1261335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1261551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1261894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1261985Z kernel = self.compile( 2025-05-07T20:32:19.1262462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1262636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1262762Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1262770Z 2025-05-07T20:32:19.1262978Z self = 2025-05-07T20:32:19.1263734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1264241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b278680>} 2025-05-07T20:32:19.1264974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1265161Z context = 2025-05-07T20:32:19.1265169Z 2025-05-07T20:32:19.1265334Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1265589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1265697Z module_map=module_map) 2025-05-07T20:32:19.1265852Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1265945Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1266025Z E ^ 2025-05-07T20:32:19.1266369Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1266374Z 2025-05-07T20:32:19.1266783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1266794Z 2025-05-07T20:32:19.1266894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1267112Z self=, 2025-05-07T20:32:19.1267195Z T=4096, 2025-05-07T20:32:19.1267267Z D=5120, 2025-05-07T20:32:19.1267346Z scale_ub=1200.0, 2025-05-07T20:32:19.1267434Z contiguous=True, 2025-05-07T20:32:19.1267516Z compiled=False, 2025-05-07T20:32:19.1267585Z ) 2025-05-07T20:32:19.1267803Z self = 2025-05-07T20:32:19.1267972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1268021Z 2025-05-07T20:32:19.1268103Z @given( 2025-05-07T20:32:19.1268218Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1268316Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1268438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1268553Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1268663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1268742Z ) 2025-05-07T20:32:19.1269024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1269116Z def test_silu_mul_quant( 2025-05-07T20:32:19.1269197Z self, 2025-05-07T20:32:19.1269274Z T: int, 2025-05-07T20:32:19.1269349Z D: int, 2025-05-07T20:32:19.1269452Z scale_ub: Optional[float], 2025-05-07T20:32:19.1269540Z contiguous: bool, 2025-05-07T20:32:19.1269630Z compiled: bool, 2025-05-07T20:32:19.1269709Z ) -> None: 2025-05-07T20:32:19.1269800Z torch.manual_seed(2025) 2025-05-07T20:32:19.1269875Z 2025-05-07T20:32:19.1270036Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1270105Z 2025-05-07T20:32:19.1270199Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1270400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1270486Z x = x_sign * x_clamp 2025-05-07T20:32:19.1270573Z x0 = x[:, :D] 2025-05-07T20:32:19.1270651Z x1 = x[:, D:] 2025-05-07T20:32:19.1270727Z 2025-05-07T20:32:19.1270813Z if contiguous: 2025-05-07T20:32:19.1270904Z x0 = x0.contiguous() 2025-05-07T20:32:19.1270995Z x1 = x1.contiguous() 2025-05-07T20:32:19.1271067Z 2025-05-07T20:32:19.1271159Z if scale_ub is not None: 2025-05-07T20:32:19.1271269Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1271400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1271478Z ) 2025-05-07T20:32:19.1271559Z else: 2025-05-07T20:32:19.1271652Z scale_ub_tensor = None 2025-05-07T20:32:19.1271722Z 2025-05-07T20:32:19.1271855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1271948Z op = silu_mul_quant 2025-05-07T20:32:19.1272031Z if compiled: 2025-05-07T20:32:19.1272135Z op = torch.compile(op) 2025-05-07T20:32:19.1272239Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1272315Z 2025-05-07T20:32:19.1272405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1272410Z 2025-05-07T20:32:19.1272505Z moe/activation_test.py:117: 2025-05-07T20:32:19.1272638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1272734Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1272833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1273325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1273424Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1273782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1274003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1274336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1274438Z kernel = self.compile( 2025-05-07T20:32:19.1274815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1274985Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1275116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1275121Z 2025-05-07T20:32:19.1275325Z self = 2025-05-07T20:32:19.1276342Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1276963Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b278f40>} 2025-05-07T20:32:19.1277927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1278143Z context = 2025-05-07T20:32:19.1278148Z 2025-05-07T20:32:19.1278328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1278643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1278758Z module_map=module_map) 2025-05-07T20:32:19.1278937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1279117Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1279199Z E ^ 2025-05-07T20:32:19.1279629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1279637Z 2025-05-07T20:32:19.1280132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1280137Z 2025-05-07T20:32:19.1280245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1280549Z self=, 2025-05-07T20:32:19.1280634Z T=1, 2025-05-07T20:32:19.1280722Z D=5120, 2025-05-07T20:32:19.1280811Z scale_ub=None, 2025-05-07T20:32:19.1280902Z contiguous=True, 2025-05-07T20:32:19.1280994Z compiled=True, 2025-05-07T20:32:19.1281067Z ) 2025-05-07T20:32:19.1281316Z self = 2025-05-07T20:32:19.1281501Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1281505Z 2025-05-07T20:32:19.1281583Z @given( 2025-05-07T20:32:19.1281709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1281821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1281946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1282080Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1282202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1282276Z ) 2025-05-07T20:32:19.1282568Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1282666Z def test_silu_mul_quant( 2025-05-07T20:32:19.1282746Z self, 2025-05-07T20:32:19.1282832Z T: int, 2025-05-07T20:32:19.1282908Z D: int, 2025-05-07T20:32:19.1283009Z scale_ub: Optional[float], 2025-05-07T20:32:19.1283108Z contiguous: bool, 2025-05-07T20:32:19.1283199Z compiled: bool, 2025-05-07T20:32:19.1283282Z ) -> None: 2025-05-07T20:32:19.1283390Z torch.manual_seed(2025) 2025-05-07T20:32:19.1283464Z 2025-05-07T20:32:19.1283648Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1283730Z 2025-05-07T20:32:19.1283825Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1283964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1284055Z x = x_sign * x_clamp 2025-05-07T20:32:19.1284136Z x0 = x[:, :D] 2025-05-07T20:32:19.1284225Z x1 = x[:, D:] 2025-05-07T20:32:19.1284299Z 2025-05-07T20:32:19.1284384Z if contiguous: 2025-05-07T20:32:19.1284489Z x0 = x0.contiguous() 2025-05-07T20:32:19.1284650Z x1 = x1.contiguous() 2025-05-07T20:32:19.1284721Z 2025-05-07T20:32:19.1284814Z if scale_ub is not None: 2025-05-07T20:32:19.1284919Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1285052Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1285131Z ) 2025-05-07T20:32:19.1285202Z else: 2025-05-07T20:32:19.1285297Z scale_ub_tensor = None 2025-05-07T20:32:19.1285369Z 2025-05-07T20:32:19.1285559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1285652Z op = silu_mul_quant 2025-05-07T20:32:19.1285733Z if compiled: 2025-05-07T20:32:19.1285828Z op = torch.compile(op) 2025-05-07T20:32:19.1285936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1286006Z 2025-05-07T20:32:19.1286093Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1286218Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1286293Z 2025-05-07T20:32:19.1286426Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1286532Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1286628Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1286828Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1286966Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1287036Z 2025-05-07T20:32:19.1287140Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1287148Z 2025-05-07T20:32:19.1287243Z moe/activation_test.py:126: 2025-05-07T20:32:19.1287369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1287479Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1287612Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1288161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1288261Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1288616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1288847Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1289205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1289456Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1289829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1289991Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1290331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1290412Z fn() 2025-05-07T20:32:19.1290803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1290892Z self.fn.run( 2025-05-07T20:32:19.1291227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1291328Z kernel = self.compile( 2025-05-07T20:32:19.1291702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1291876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1292007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1292012Z 2025-05-07T20:32:19.1292217Z self = 2025-05-07T20:32:19.1292979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1293746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60117f60>} 2025-05-07T20:32:19.1294669Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1294904Z context = 2025-05-07T20:32:19.1294908Z 2025-05-07T20:32:19.1295069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1295335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1295443Z module_map=module_map) 2025-05-07T20:32:19.1295604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1295714Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1295788Z E ^ 2025-05-07T20:32:19.1296213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1296218Z 2025-05-07T20:32:19.1296632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1296639Z 2025-05-07T20:32:19.1296740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1296963Z self=, 2025-05-07T20:32:19.1297040Z T=2048, 2025-05-07T20:32:19.1297112Z D=5120, 2025-05-07T20:32:19.1297196Z scale_ub=None, 2025-05-07T20:32:19.1297277Z contiguous=True, 2025-05-07T20:32:19.1297358Z compiled=True, 2025-05-07T20:32:19.1297436Z ) 2025-05-07T20:32:19.1297651Z self = 2025-05-07T20:32:19.1297816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1297825Z 2025-05-07T20:32:19.1297901Z @given( 2025-05-07T20:32:19.1298025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1298128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1298516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1298682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1298809Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1298882Z ) 2025-05-07T20:32:19.1299120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1299218Z def test_silu_mul_quant( 2025-05-07T20:32:19.1299297Z self, 2025-05-07T20:32:19.1299379Z T: int, 2025-05-07T20:32:19.1299454Z D: int, 2025-05-07T20:32:19.1299551Z scale_ub: Optional[float], 2025-05-07T20:32:19.1299643Z contiguous: bool, 2025-05-07T20:32:19.1299727Z compiled: bool, 2025-05-07T20:32:19.1299804Z ) -> None: 2025-05-07T20:32:19.1299900Z torch.manual_seed(2025) 2025-05-07T20:32:19.1299968Z 2025-05-07T20:32:19.1300138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1300216Z 2025-05-07T20:32:19.1300304Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1300425Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1300521Z x = x_sign * x_clamp 2025-05-07T20:32:19.1300599Z x0 = x[:, :D] 2025-05-07T20:32:19.1300678Z x1 = x[:, D:] 2025-05-07T20:32:19.1300754Z 2025-05-07T20:32:19.1300834Z if contiguous: 2025-05-07T20:32:19.1300925Z x0 = x0.contiguous() 2025-05-07T20:32:19.1301010Z x1 = x1.contiguous() 2025-05-07T20:32:19.1301082Z 2025-05-07T20:32:19.1301176Z if scale_ub is not None: 2025-05-07T20:32:19.1301470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1301603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1301683Z ) 2025-05-07T20:32:19.1301760Z else: 2025-05-07T20:32:19.1301859Z scale_ub_tensor = None 2025-05-07T20:32:19.1301937Z 2025-05-07T20:32:19.1302062Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1302151Z op = silu_mul_quant 2025-05-07T20:32:19.1302239Z if compiled: 2025-05-07T20:32:19.1302410Z op = torch.compile(op) 2025-05-07T20:32:19.1302520Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1302592Z 2025-05-07T20:32:19.1302679Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1302806Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1302875Z 2025-05-07T20:32:19.1303008Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1303119Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1303217Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1303338Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1303480Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1303671Z 2025-05-07T20:32:19.1303770Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1303780Z 2025-05-07T20:32:19.1303876Z moe/activation_test.py:126: 2025-05-07T20:32:19.1304002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1304114Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1304244Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1304791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1304896Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1305250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1305473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1305833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1306082Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1306457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1306625Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1306958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1307044Z fn() 2025-05-07T20:32:19.1307435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1307524Z self.fn.run( 2025-05-07T20:32:19.1307855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1307946Z kernel = self.compile( 2025-05-07T20:32:19.1308330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1308500Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1308628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1308640Z 2025-05-07T20:32:19.1308842Z self = 2025-05-07T20:32:19.1309598Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1310148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae28cc0>} 2025-05-07T20:32:19.1310882Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1311076Z context = 2025-05-07T20:32:19.1311120Z 2025-05-07T20:32:19.1311284Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1311537Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1311649Z module_map=module_map) 2025-05-07T20:32:19.1311810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1311917Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1311995Z E ^ 2025-05-07T20:32:19.1312347Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1312352Z 2025-05-07T20:32:19.1312836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1312842Z 2025-05-07T20:32:19.1312941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1313160Z self=, 2025-05-07T20:32:19.1313246Z T=128, 2025-05-07T20:32:19.1313321Z D=5120, 2025-05-07T20:32:19.1313407Z scale_ub=None, 2025-05-07T20:32:19.1313491Z contiguous=True, 2025-05-07T20:32:19.1313571Z compiled=True, 2025-05-07T20:32:19.1313645Z ) 2025-05-07T20:32:19.1313858Z self = 2025-05-07T20:32:19.1314021Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1314029Z 2025-05-07T20:32:19.1314106Z @given( 2025-05-07T20:32:19.1314223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1314318Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1314446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1314560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1314677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1314748Z ) 2025-05-07T20:32:19.1314992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1315091Z def test_silu_mul_quant( 2025-05-07T20:32:19.1315166Z self, 2025-05-07T20:32:19.1315239Z T: int, 2025-05-07T20:32:19.1315324Z D: int, 2025-05-07T20:32:19.1315417Z scale_ub: Optional[float], 2025-05-07T20:32:19.1315503Z contiguous: bool, 2025-05-07T20:32:19.1315591Z compiled: bool, 2025-05-07T20:32:19.1315669Z ) -> None: 2025-05-07T20:32:19.1315759Z torch.manual_seed(2025) 2025-05-07T20:32:19.1315835Z 2025-05-07T20:32:19.1316000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1316076Z 2025-05-07T20:32:19.1316170Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1316291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1316383Z x = x_sign * x_clamp 2025-05-07T20:32:19.1316461Z x0 = x[:, :D] 2025-05-07T20:32:19.1316541Z x1 = x[:, D:] 2025-05-07T20:32:19.1316619Z 2025-05-07T20:32:19.1316699Z if contiguous: 2025-05-07T20:32:19.1316787Z x0 = x0.contiguous() 2025-05-07T20:32:19.1316880Z x1 = x1.contiguous() 2025-05-07T20:32:19.1316952Z 2025-05-07T20:32:19.1317039Z if scale_ub is not None: 2025-05-07T20:32:19.1317145Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1317274Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1317430Z ) 2025-05-07T20:32:19.1317506Z else: 2025-05-07T20:32:19.1317599Z scale_ub_tensor = None 2025-05-07T20:32:19.1317675Z 2025-05-07T20:32:19.1317801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1317893Z op = silu_mul_quant 2025-05-07T20:32:19.1317984Z if compiled: 2025-05-07T20:32:19.1318080Z op = torch.compile(op) 2025-05-07T20:32:19.1318181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1318303Z 2025-05-07T20:32:19.1318392Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1318515Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1318595Z 2025-05-07T20:32:19.1318727Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1318833Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1318930Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1319048Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1319194Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1319266Z 2025-05-07T20:32:19.1325212Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1325222Z 2025-05-07T20:32:19.1325458Z moe/activation_test.py:126: 2025-05-07T20:32:19.1325598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1325719Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1325859Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1326432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1326536Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1326896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1327130Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1327496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1327762Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1328131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1328300Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1328654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1328733Z fn() 2025-05-07T20:32:19.1329132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1329227Z self.fn.run( 2025-05-07T20:32:19.1329562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1329669Z kernel = self.compile( 2025-05-07T20:32:19.1330050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1330232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1330381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1330387Z 2025-05-07T20:32:19.1330631Z self = 2025-05-07T20:32:19.1331420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1331924Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae48f40>} 2025-05-07T20:32:19.1332707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1332912Z context = 2025-05-07T20:32:19.1332916Z 2025-05-07T20:32:19.1333083Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1333390Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1333506Z module_map=module_map) 2025-05-07T20:32:19.1333780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1333898Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1333978Z E ^ 2025-05-07T20:32:19.1334329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1334346Z 2025-05-07T20:32:19.1334757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1334762Z 2025-05-07T20:32:19.1334947Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1335177Z self=, 2025-05-07T20:32:19.1335258Z T=4096, 2025-05-07T20:32:19.1335337Z D=5120, 2025-05-07T20:32:19.1335434Z scale_ub=None, 2025-05-07T20:32:19.1335521Z contiguous=True, 2025-05-07T20:32:19.1335605Z compiled=True, 2025-05-07T20:32:19.1335688Z ) 2025-05-07T20:32:19.1335908Z self = 2025-05-07T20:32:19.1336085Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1336090Z 2025-05-07T20:32:19.1336168Z @given( 2025-05-07T20:32:19.1336291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1336404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1336522Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1336640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1336768Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1336847Z ) 2025-05-07T20:32:19.1337093Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1337196Z def test_silu_mul_quant( 2025-05-07T20:32:19.1337278Z self, 2025-05-07T20:32:19.1337366Z T: int, 2025-05-07T20:32:19.1337445Z D: int, 2025-05-07T20:32:19.1337548Z scale_ub: Optional[float], 2025-05-07T20:32:19.1337646Z contiguous: bool, 2025-05-07T20:32:19.1337735Z compiled: bool, 2025-05-07T20:32:19.1337817Z ) -> None: 2025-05-07T20:32:19.1337927Z torch.manual_seed(2025) 2025-05-07T20:32:19.1338003Z 2025-05-07T20:32:19.1338177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1338262Z 2025-05-07T20:32:19.1338358Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1338490Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1338593Z x = x_sign * x_clamp 2025-05-07T20:32:19.1338683Z x0 = x[:, :D] 2025-05-07T20:32:19.1338776Z x1 = x[:, D:] 2025-05-07T20:32:19.1338855Z 2025-05-07T20:32:19.1338941Z if contiguous: 2025-05-07T20:32:19.1339044Z x0 = x0.contiguous() 2025-05-07T20:32:19.1339139Z x1 = x1.contiguous() 2025-05-07T20:32:19.1339215Z 2025-05-07T20:32:19.1339317Z if scale_ub is not None: 2025-05-07T20:32:19.1339425Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1339561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1339650Z ) 2025-05-07T20:32:19.1339728Z else: 2025-05-07T20:32:19.1339824Z scale_ub_tensor = None 2025-05-07T20:32:19.1339951Z 2025-05-07T20:32:19.1340081Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1340182Z op = silu_mul_quant 2025-05-07T20:32:19.1340269Z if compiled: 2025-05-07T20:32:19.1340371Z op = torch.compile(op) 2025-05-07T20:32:19.1340495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1340570Z 2025-05-07T20:32:19.1340663Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1340792Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1340909Z 2025-05-07T20:32:19.1341051Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1341155Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1341258Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1341387Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1341526Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1341607Z 2025-05-07T20:32:19.1341715Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1341719Z 2025-05-07T20:32:19.1341819Z moe/activation_test.py:126: 2025-05-07T20:32:19.1341952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1342143Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1342278Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1342836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1342942Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1343299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1343528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1343892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1344157Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1344531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1344697Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1345043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1345128Z fn() 2025-05-07T20:32:19.1345527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1345616Z self.fn.run( 2025-05-07T20:32:19.1345950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1346050Z kernel = self.compile( 2025-05-07T20:32:19.1346427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1346603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1346737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1346746Z 2025-05-07T20:32:19.1346953Z self = 2025-05-07T20:32:19.1347724Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1348228Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5ae4b1a0>} 2025-05-07T20:32:19.1348961Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1349204Z context = 2025-05-07T20:32:19.1349209Z 2025-05-07T20:32:19.1349378Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1349645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1349755Z module_map=module_map) 2025-05-07T20:32:19.1349957Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1350066Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1350145Z E ^ 2025-05-07T20:32:19.1350495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1350507Z 2025-05-07T20:32:19.1350913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1350920Z 2025-05-07T20:32:19.1351026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1351258Z self=, 2025-05-07T20:32:19.1351437Z T=16384, 2025-05-07T20:32:19.1351518Z D=5120, 2025-05-07T20:32:19.1351609Z scale_ub=None, 2025-05-07T20:32:19.1351695Z contiguous=True, 2025-05-07T20:32:19.1351778Z compiled=True, 2025-05-07T20:32:19.1351860Z ) 2025-05-07T20:32:19.1352083Z self = 2025-05-07T20:32:19.1352260Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1352265Z 2025-05-07T20:32:19.1352343Z @given( 2025-05-07T20:32:19.1352463Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1352570Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1352688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1352809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1352930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1353007Z ) 2025-05-07T20:32:19.1353256Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1353357Z def test_silu_mul_quant( 2025-05-07T20:32:19.1353435Z self, 2025-05-07T20:32:19.1353518Z T: int, 2025-05-07T20:32:19.1353597Z D: int, 2025-05-07T20:32:19.1353698Z scale_ub: Optional[float], 2025-05-07T20:32:19.1353794Z contiguous: bool, 2025-05-07T20:32:19.1353881Z compiled: bool, 2025-05-07T20:32:19.1353962Z ) -> None: 2025-05-07T20:32:19.1354064Z torch.manual_seed(2025) 2025-05-07T20:32:19.1354140Z 2025-05-07T20:32:19.1354307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1354389Z 2025-05-07T20:32:19.1354483Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1354616Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1354712Z x = x_sign * x_clamp 2025-05-07T20:32:19.1354796Z x0 = x[:, :D] 2025-05-07T20:32:19.1354886Z x1 = x[:, D:] 2025-05-07T20:32:19.1354960Z 2025-05-07T20:32:19.1355049Z if contiguous: 2025-05-07T20:32:19.1355148Z x0 = x0.contiguous() 2025-05-07T20:32:19.1355240Z x1 = x1.contiguous() 2025-05-07T20:32:19.1355316Z 2025-05-07T20:32:19.1355417Z if scale_ub is not None: 2025-05-07T20:32:19.1355527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1355662Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1355746Z ) 2025-05-07T20:32:19.1355824Z else: 2025-05-07T20:32:19.1355921Z scale_ub_tensor = None 2025-05-07T20:32:19.1356006Z 2025-05-07T20:32:19.1356136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1356282Z op = silu_mul_quant 2025-05-07T20:32:19.1356369Z if compiled: 2025-05-07T20:32:19.1356470Z op = torch.compile(op) 2025-05-07T20:32:19.1356589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1356663Z 2025-05-07T20:32:19.1356761Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1356891Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1356965Z 2025-05-07T20:32:19.1357101Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1357256Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1357358Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1357481Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1357632Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1357707Z 2025-05-07T20:32:19.1357814Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1357819Z 2025-05-07T20:32:19.1357923Z moe/activation_test.py:126: 2025-05-07T20:32:19.1358055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1358169Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1358305Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1358923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1359037Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1359396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1359626Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1359992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1360248Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1360628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1360795Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1361142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1361222Z fn() 2025-05-07T20:32:19.1361618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1361713Z self.fn.run( 2025-05-07T20:32:19.1362052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1362145Z kernel = self.compile( 2025-05-07T20:32:19.1362528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1362702Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1362840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1362845Z 2025-05-07T20:32:19.1363051Z self = 2025-05-07T20:32:19.1363821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1364331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5bce8e0>} 2025-05-07T20:32:19.1365062Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1365330Z context = 2025-05-07T20:32:19.1365334Z 2025-05-07T20:32:19.1365517Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1365826Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1365946Z module_map=module_map) 2025-05-07T20:32:19.1366126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1366239Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1366357Z E ^ 2025-05-07T20:32:19.1366708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1366712Z 2025-05-07T20:32:19.1367126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1367130Z 2025-05-07T20:32:19.1367233Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1367463Z self=, 2025-05-07T20:32:19.1367542Z T=1, 2025-05-07T20:32:19.1367622Z D=5120, 2025-05-07T20:32:19.1367714Z scale_ub=1200.0, 2025-05-07T20:32:19.1367800Z contiguous=True, 2025-05-07T20:32:19.1367959Z compiled=True, 2025-05-07T20:32:19.1368041Z ) 2025-05-07T20:32:19.1368259Z self = 2025-05-07T20:32:19.1368424Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1368432Z 2025-05-07T20:32:19.1368517Z @given( 2025-05-07T20:32:19.1368637Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1368743Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1368859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1368978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1369098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1369182Z ) 2025-05-07T20:32:19.1369427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1369528Z def test_silu_mul_quant( 2025-05-07T20:32:19.1369607Z self, 2025-05-07T20:32:19.1369693Z T: int, 2025-05-07T20:32:19.1369778Z D: int, 2025-05-07T20:32:19.1369876Z scale_ub: Optional[float], 2025-05-07T20:32:19.1369967Z contiguous: bool, 2025-05-07T20:32:19.1370064Z compiled: bool, 2025-05-07T20:32:19.1370148Z ) -> None: 2025-05-07T20:32:19.1370253Z torch.manual_seed(2025) 2025-05-07T20:32:19.1370327Z 2025-05-07T20:32:19.1370497Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1370579Z 2025-05-07T20:32:19.1370672Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1370798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1370897Z x = x_sign * x_clamp 2025-05-07T20:32:19.1370981Z x0 = x[:, :D] 2025-05-07T20:32:19.1371063Z x1 = x[:, D:] 2025-05-07T20:32:19.1371143Z 2025-05-07T20:32:19.1371229Z if contiguous: 2025-05-07T20:32:19.1371322Z x0 = x0.contiguous() 2025-05-07T20:32:19.1371423Z x1 = x1.contiguous() 2025-05-07T20:32:19.1371502Z 2025-05-07T20:32:19.1371595Z if scale_ub is not None: 2025-05-07T20:32:19.1371710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1371846Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1371934Z ) 2025-05-07T20:32:19.1372011Z else: 2025-05-07T20:32:19.1372107Z scale_ub_tensor = None 2025-05-07T20:32:19.1372188Z 2025-05-07T20:32:19.1372319Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1372411Z op = silu_mul_quant 2025-05-07T20:32:19.1372505Z if compiled: 2025-05-07T20:32:19.1372607Z op = torch.compile(op) 2025-05-07T20:32:19.1372764Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1372846Z 2025-05-07T20:32:19.1372941Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1372946Z 2025-05-07T20:32:19.1373051Z moe/activation_test.py:117: 2025-05-07T20:32:19.1373186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1373289Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1373396Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1373856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1373997Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1374493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1374592Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1374955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1375182Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1375522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1375699Z kernel = self.compile( 2025-05-07T20:32:19.1376081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1376256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1376397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1376401Z 2025-05-07T20:32:19.1376610Z self = 2025-05-07T20:32:19.1377382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1377884Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f4400>} 2025-05-07T20:32:19.1378626Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1378821Z context = 2025-05-07T20:32:19.1378825Z 2025-05-07T20:32:19.1378993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1379258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1379367Z module_map=module_map) 2025-05-07T20:32:19.1379530Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1379640Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1379720Z E ^ 2025-05-07T20:32:19.1380075Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1380079Z 2025-05-07T20:32:19.1380491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1380495Z 2025-05-07T20:32:19.1380600Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1380828Z self=, 2025-05-07T20:32:19.1380906Z T=1, 2025-05-07T20:32:19.1380989Z D=5120, 2025-05-07T20:32:19.1381073Z scale_ub=None, 2025-05-07T20:32:19.1381161Z contiguous=False, 2025-05-07T20:32:19.1381256Z compiled=True, 2025-05-07T20:32:19.1381330Z ) 2025-05-07T20:32:19.1381547Z self = 2025-05-07T20:32:19.1381762Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1381767Z 2025-05-07T20:32:19.1381844Z @given( 2025-05-07T20:32:19.1381965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1382081Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1382197Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1382321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1382438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1382564Z ) 2025-05-07T20:32:19.1382817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1382913Z def test_silu_mul_quant( 2025-05-07T20:32:19.1382990Z self, 2025-05-07T20:32:19.1383078Z T: int, 2025-05-07T20:32:19.1383155Z D: int, 2025-05-07T20:32:19.1383254Z scale_ub: Optional[float], 2025-05-07T20:32:19.1383354Z contiguous: bool, 2025-05-07T20:32:19.1383444Z compiled: bool, 2025-05-07T20:32:19.1383524Z ) -> None: 2025-05-07T20:32:19.1383626Z torch.manual_seed(2025) 2025-05-07T20:32:19.1383700Z 2025-05-07T20:32:19.1383873Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1384050Z 2025-05-07T20:32:19.1384144Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1384277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1384369Z x = x_sign * x_clamp 2025-05-07T20:32:19.1384455Z x0 = x[:, :D] 2025-05-07T20:32:19.1384542Z x1 = x[:, D:] 2025-05-07T20:32:19.1384618Z 2025-05-07T20:32:19.1384704Z if contiguous: 2025-05-07T20:32:19.1384803Z x0 = x0.contiguous() 2025-05-07T20:32:19.1384896Z x1 = x1.contiguous() 2025-05-07T20:32:19.1384971Z 2025-05-07T20:32:19.1385071Z if scale_ub is not None: 2025-05-07T20:32:19.1385178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1385318Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1385401Z ) 2025-05-07T20:32:19.1385478Z else: 2025-05-07T20:32:19.1385580Z scale_ub_tensor = None 2025-05-07T20:32:19.1385654Z 2025-05-07T20:32:19.1385788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1385890Z op = silu_mul_quant 2025-05-07T20:32:19.1385977Z if compiled: 2025-05-07T20:32:19.1386081Z op = torch.compile(op) 2025-05-07T20:32:19.1386198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1386272Z 2025-05-07T20:32:19.1386373Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1386495Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1386575Z 2025-05-07T20:32:19.1386711Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1386815Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1386924Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1387049Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1387195Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1387273Z 2025-05-07T20:32:19.1387375Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1387385Z 2025-05-07T20:32:19.1387492Z moe/activation_test.py:126: 2025-05-07T20:32:19.1387622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1387730Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1387874Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1388423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1388533Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1388890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1389162Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1389529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1389790Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1390161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1390382Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1390853Z fn() 2025-05-07T20:32:19.1391249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1391333Z self.fn.run( 2025-05-07T20:32:19.1391680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1391774Z kernel = self.compile( 2025-05-07T20:32:19.1392227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1392412Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1392541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1392551Z 2025-05-07T20:32:19.1392764Z self = 2025-05-07T20:32:19.1393531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1394039Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9ee020>} 2025-05-07T20:32:19.1394780Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1394974Z context = 2025-05-07T20:32:19.1394979Z 2025-05-07T20:32:19.1395151Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1395412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1395527Z module_map=module_map) 2025-05-07T20:32:19.1395688Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1395791Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1395874Z E ^ 2025-05-07T20:32:19.1396226Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1396230Z 2025-05-07T20:32:19.1396633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1396651Z 2025-05-07T20:32:19.1396756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1396977Z self=, 2025-05-07T20:32:19.1397060Z T=1, 2025-05-07T20:32:19.1397140Z D=5120, 2025-05-07T20:32:19.1397227Z scale_ub=None, 2025-05-07T20:32:19.1397318Z contiguous=True, 2025-05-07T20:32:19.1397403Z compiled=False, 2025-05-07T20:32:19.1397478Z ) 2025-05-07T20:32:19.1397701Z self = 2025-05-07T20:32:19.1397863Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.1397868Z 2025-05-07T20:32:19.1397991Z @given( 2025-05-07T20:32:19.1398118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1398500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1398671Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1398841Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1398957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1399041Z ) 2025-05-07T20:32:19.1399284Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1399528Z def test_silu_mul_quant( 2025-05-07T20:32:19.1399613Z self, 2025-05-07T20:32:19.1399692Z T: int, 2025-05-07T20:32:19.1399770Z D: int, 2025-05-07T20:32:19.1399875Z scale_ub: Optional[float], 2025-05-07T20:32:19.1399964Z contiguous: bool, 2025-05-07T20:32:19.1400056Z compiled: bool, 2025-05-07T20:32:19.1400138Z ) -> None: 2025-05-07T20:32:19.1400232Z torch.manual_seed(2025) 2025-05-07T20:32:19.1400315Z 2025-05-07T20:32:19.1400483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1400556Z 2025-05-07T20:32:19.1400655Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1400781Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1401006Z x = x_sign * x_clamp 2025-05-07T20:32:19.1401098Z x0 = x[:, :D] 2025-05-07T20:32:19.1401181Z x1 = x[:, D:] 2025-05-07T20:32:19.1401254Z 2025-05-07T20:32:19.1401347Z if contiguous: 2025-05-07T20:32:19.1401443Z x0 = x0.contiguous() 2025-05-07T20:32:19.1401534Z x1 = x1.contiguous() 2025-05-07T20:32:19.1401619Z 2025-05-07T20:32:19.1401712Z if scale_ub is not None: 2025-05-07T20:32:19.1401851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1402027Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1402137Z ) 2025-05-07T20:32:19.1402232Z else: 2025-05-07T20:32:19.1402329Z scale_ub_tensor = None 2025-05-07T20:32:19.1402404Z 2025-05-07T20:32:19.1402540Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1402630Z op = silu_mul_quant 2025-05-07T20:32:19.1402715Z if compiled: 2025-05-07T20:32:19.1402831Z op = torch.compile(op) 2025-05-07T20:32:19.1402937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1403010Z 2025-05-07T20:32:19.1403106Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1403114Z 2025-05-07T20:32:19.1403211Z moe/activation_test.py:117: 2025-05-07T20:32:19.1403349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1403450Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1403551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1404051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1404153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1404508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1404740Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1405076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1405177Z kernel = self.compile( 2025-05-07T20:32:19.1405559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1405731Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1405869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1405873Z 2025-05-07T20:32:19.1406079Z self = 2025-05-07T20:32:19.1406928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1407433Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce60114400>} 2025-05-07T20:32:19.1408164Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1408403Z context = 2025-05-07T20:32:19.1408408Z 2025-05-07T20:32:19.1408571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1408835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1408947Z module_map=module_map) 2025-05-07T20:32:19.1409110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1409216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1409294Z E ^ 2025-05-07T20:32:19.1409732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1409737Z 2025-05-07T20:32:19.1410143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1410150Z 2025-05-07T20:32:19.1410255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1410481Z self=, 2025-05-07T20:32:19.1410561Z T=128, 2025-05-07T20:32:19.1410639Z D=5120, 2025-05-07T20:32:19.1410730Z scale_ub=None, 2025-05-07T20:32:19.1410818Z contiguous=False, 2025-05-07T20:32:19.1410912Z compiled=True, 2025-05-07T20:32:19.1410987Z ) 2025-05-07T20:32:19.1411203Z self = 2025-05-07T20:32:19.1411376Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1411380Z 2025-05-07T20:32:19.1411464Z @given( 2025-05-07T20:32:19.1411584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1411693Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1411809Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1411930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1412050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1412125Z ) 2025-05-07T20:32:19.1412372Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1412468Z def test_silu_mul_quant( 2025-05-07T20:32:19.1412544Z self, 2025-05-07T20:32:19.1412630Z T: int, 2025-05-07T20:32:19.1412707Z D: int, 2025-05-07T20:32:19.1412804Z scale_ub: Optional[float], 2025-05-07T20:32:19.1412900Z contiguous: bool, 2025-05-07T20:32:19.1412987Z compiled: bool, 2025-05-07T20:32:19.1413067Z ) -> None: 2025-05-07T20:32:19.1413178Z torch.manual_seed(2025) 2025-05-07T20:32:19.1413251Z 2025-05-07T20:32:19.1413418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1413497Z 2025-05-07T20:32:19.1413589Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1413811Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1413902Z x = x_sign * x_clamp 2025-05-07T20:32:19.1413984Z x0 = x[:, :D] 2025-05-07T20:32:19.1414072Z x1 = x[:, D:] 2025-05-07T20:32:19.1414145Z 2025-05-07T20:32:19.1414229Z if contiguous: 2025-05-07T20:32:19.1414326Z x0 = x0.contiguous() 2025-05-07T20:32:19.1414416Z x1 = x1.contiguous() 2025-05-07T20:32:19.1414539Z 2025-05-07T20:32:19.1414633Z if scale_ub is not None: 2025-05-07T20:32:19.1414738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1414872Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1414953Z ) 2025-05-07T20:32:19.1415034Z else: 2025-05-07T20:32:19.1415128Z scale_ub_tensor = None 2025-05-07T20:32:19.1415206Z 2025-05-07T20:32:19.1415335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1415496Z op = silu_mul_quant 2025-05-07T20:32:19.1415581Z if compiled: 2025-05-07T20:32:19.1415680Z op = torch.compile(op) 2025-05-07T20:32:19.1415790Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1415863Z 2025-05-07T20:32:19.1415953Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1415958Z 2025-05-07T20:32:19.1416062Z moe/activation_test.py:117: 2025-05-07T20:32:19.1416189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1416294Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1416399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1416835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1416936Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1417423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1417524Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1417885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1418104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1418447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1418543Z kernel = self.compile( 2025-05-07T20:32:19.1418920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1419097Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1419231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1419236Z 2025-05-07T20:32:19.1419440Z self = 2025-05-07T20:32:19.1420213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1420713Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5d91ee0>} 2025-05-07T20:32:19.1421453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1421647Z context = 2025-05-07T20:32:19.1421652Z 2025-05-07T20:32:19.1421821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1422079Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1422190Z module_map=module_map) 2025-05-07T20:32:19.1422357Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1422457Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1422537Z E ^ 2025-05-07T20:32:19.1422893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1422975Z 2025-05-07T20:32:19.1423378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1423383Z 2025-05-07T20:32:19.1423493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1423719Z self=, 2025-05-07T20:32:19.1423798Z T=128, 2025-05-07T20:32:19.1423884Z D=7168, 2025-05-07T20:32:19.1423967Z scale_ub=1200.0, 2025-05-07T20:32:19.1424055Z contiguous=False, 2025-05-07T20:32:19.1424188Z compiled=False, 2025-05-07T20:32:19.1424262Z ) 2025-05-07T20:32:19.1424491Z self = 2025-05-07T20:32:19.1424662Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1424667Z 2025-05-07T20:32:19.1424743Z @given( 2025-05-07T20:32:19.1424867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1424971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1425088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1425212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1425324Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1425400Z ) 2025-05-07T20:32:19.1425725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1425820Z def test_silu_mul_quant( 2025-05-07T20:32:19.1425903Z self, 2025-05-07T20:32:19.1425990Z T: int, 2025-05-07T20:32:19.1426068Z D: int, 2025-05-07T20:32:19.1426176Z scale_ub: Optional[float], 2025-05-07T20:32:19.1426266Z contiguous: bool, 2025-05-07T20:32:19.1426352Z compiled: bool, 2025-05-07T20:32:19.1426437Z ) -> None: 2025-05-07T20:32:19.1426532Z torch.manual_seed(2025) 2025-05-07T20:32:19.1426605Z 2025-05-07T20:32:19.1426775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1426853Z 2025-05-07T20:32:19.1426949Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1427078Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1427168Z x = x_sign * x_clamp 2025-05-07T20:32:19.1427255Z x0 = x[:, :D] 2025-05-07T20:32:19.1427340Z x1 = x[:, D:] 2025-05-07T20:32:19.1427414Z 2025-05-07T20:32:19.1427503Z if contiguous: 2025-05-07T20:32:19.1427593Z x0 = x0.contiguous() 2025-05-07T20:32:19.1427684Z x1 = x1.contiguous() 2025-05-07T20:32:19.1427765Z 2025-05-07T20:32:19.1427855Z if scale_ub is not None: 2025-05-07T20:32:19.1427963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1428102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1428178Z ) 2025-05-07T20:32:19.1428255Z else: 2025-05-07T20:32:19.1428353Z scale_ub_tensor = None 2025-05-07T20:32:19.1428426Z 2025-05-07T20:32:19.1428559Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1428656Z op = silu_mul_quant 2025-05-07T20:32:19.1428742Z if compiled: 2025-05-07T20:32:19.1428848Z op = torch.compile(op) 2025-05-07T20:32:19.1428956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1429034Z 2025-05-07T20:32:19.1429131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1429136Z 2025-05-07T20:32:19.1429233Z moe/activation_test.py:117: 2025-05-07T20:32:19.1429361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1429471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1429569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1430067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1430164Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1430519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1430794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1431134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1431228Z kernel = self.compile( 2025-05-07T20:32:19.1431614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1431831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1431963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1431967Z 2025-05-07T20:32:19.1432170Z self = 2025-05-07T20:32:19.1432933Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1433560Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b767560>} 2025-05-07T20:32:19.1434293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1434491Z context = 2025-05-07T20:32:19.1434496Z 2025-05-07T20:32:19.1434660Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1434920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1435037Z module_map=module_map) 2025-05-07T20:32:19.1435199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1435306Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1435388Z E ^ 2025-05-07T20:32:19.1435741Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1435746Z 2025-05-07T20:32:19.1436165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1436169Z 2025-05-07T20:32:19.1436277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1436506Z self=, 2025-05-07T20:32:19.1436582Z T=128, 2025-05-07T20:32:19.1436660Z D=5120, 2025-05-07T20:32:19.1436751Z scale_ub=None, 2025-05-07T20:32:19.1436839Z contiguous=False, 2025-05-07T20:32:19.1436926Z compiled=False, 2025-05-07T20:32:19.1437008Z ) 2025-05-07T20:32:19.1437228Z self = 2025-05-07T20:32:19.1437400Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1437405Z 2025-05-07T20:32:19.1437493Z @given( 2025-05-07T20:32:19.1437612Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1437726Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1437842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1437959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1438082Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1438159Z ) 2025-05-07T20:32:19.1438400Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1438502Z def test_silu_mul_quant( 2025-05-07T20:32:19.1438580Z self, 2025-05-07T20:32:19.1438658Z T: int, 2025-05-07T20:32:19.1438739Z D: int, 2025-05-07T20:32:19.1438838Z scale_ub: Optional[float], 2025-05-07T20:32:19.1438978Z contiguous: bool, 2025-05-07T20:32:19.1439071Z compiled: bool, 2025-05-07T20:32:19.1439151Z ) -> None: 2025-05-07T20:32:19.1439251Z torch.manual_seed(2025) 2025-05-07T20:32:19.1439324Z 2025-05-07T20:32:19.1439496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1439576Z 2025-05-07T20:32:19.1439668Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1439793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1439933Z x = x_sign * x_clamp 2025-05-07T20:32:19.1440014Z x0 = x[:, :D] 2025-05-07T20:32:19.1440094Z x1 = x[:, D:] 2025-05-07T20:32:19.1440173Z 2025-05-07T20:32:19.1440256Z if contiguous: 2025-05-07T20:32:19.1440346Z x0 = x0.contiguous() 2025-05-07T20:32:19.1440443Z x1 = x1.contiguous() 2025-05-07T20:32:19.1440515Z 2025-05-07T20:32:19.1440613Z if scale_ub is not None: 2025-05-07T20:32:19.1440721Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1440853Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1440935Z ) 2025-05-07T20:32:19.1441012Z else: 2025-05-07T20:32:19.1441106Z scale_ub_tensor = None 2025-05-07T20:32:19.1441189Z 2025-05-07T20:32:19.1441392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1441484Z op = silu_mul_quant 2025-05-07T20:32:19.1441580Z if compiled: 2025-05-07T20:32:19.1441681Z op = torch.compile(op) 2025-05-07T20:32:19.1441789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1441868Z 2025-05-07T20:32:19.1441960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1441965Z 2025-05-07T20:32:19.1442068Z moe/activation_test.py:117: 2025-05-07T20:32:19.1442196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1442296Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1442408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1442898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1442996Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1443363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1443578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1443924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1444017Z kernel = self.compile( 2025-05-07T20:32:19.1444397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1444573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1444699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1444705Z 2025-05-07T20:32:19.1444913Z self = 2025-05-07T20:32:19.1445679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1446177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a892a20>} 2025-05-07T20:32:19.1446918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1447109Z context = 2025-05-07T20:32:19.1447184Z 2025-05-07T20:32:19.1447357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1447616Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1447729Z module_map=module_map) 2025-05-07T20:32:19.1447897Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1447996Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1448074Z E ^ 2025-05-07T20:32:19.1448471Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1448476Z 2025-05-07T20:32:19.1448881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1448885Z 2025-05-07T20:32:19.1448997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1449216Z self=, 2025-05-07T20:32:19.1449297Z T=128, 2025-05-07T20:32:19.1449380Z D=5120, 2025-05-07T20:32:19.1449465Z scale_ub=1200.0, 2025-05-07T20:32:19.1449557Z contiguous=True, 2025-05-07T20:32:19.1449655Z compiled=False, 2025-05-07T20:32:19.1449730Z ) 2025-05-07T20:32:19.1450038Z self = 2025-05-07T20:32:19.1450209Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1450214Z 2025-05-07T20:32:19.1450294Z @given( 2025-05-07T20:32:19.1450424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1450527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1450640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1450761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1456942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1457049Z ) 2025-05-07T20:32:19.1457310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1457417Z def test_silu_mul_quant( 2025-05-07T20:32:19.1457497Z self, 2025-05-07T20:32:19.1457575Z T: int, 2025-05-07T20:32:19.1457662Z D: int, 2025-05-07T20:32:19.1457769Z scale_ub: Optional[float], 2025-05-07T20:32:19.1457864Z contiguous: bool, 2025-05-07T20:32:19.1457959Z compiled: bool, 2025-05-07T20:32:19.1458040Z ) -> None: 2025-05-07T20:32:19.1458138Z torch.manual_seed(2025) 2025-05-07T20:32:19.1458226Z 2025-05-07T20:32:19.1458398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1458474Z 2025-05-07T20:32:19.1458587Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1458713Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1458812Z x = x_sign * x_clamp 2025-05-07T20:32:19.1458898Z x0 = x[:, :D] 2025-05-07T20:32:19.1458983Z x1 = x[:, D:] 2025-05-07T20:32:19.1459068Z 2025-05-07T20:32:19.1459154Z if contiguous: 2025-05-07T20:32:19.1459250Z x0 = x0.contiguous() 2025-05-07T20:32:19.1459351Z x1 = x1.contiguous() 2025-05-07T20:32:19.1459427Z 2025-05-07T20:32:19.1459521Z if scale_ub is not None: 2025-05-07T20:32:19.1459645Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1459784Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1459863Z ) 2025-05-07T20:32:19.1459951Z else: 2025-05-07T20:32:19.1460050Z scale_ub_tensor = None 2025-05-07T20:32:19.1460126Z 2025-05-07T20:32:19.1460269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1460362Z op = silu_mul_quant 2025-05-07T20:32:19.1460457Z if compiled: 2025-05-07T20:32:19.1460559Z op = torch.compile(op) 2025-05-07T20:32:19.1460667Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1460824Z 2025-05-07T20:32:19.1460919Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1460924Z 2025-05-07T20:32:19.1461025Z moe/activation_test.py:117: 2025-05-07T20:32:19.1461167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1461275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1461378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1461883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1462033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1462398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1462622Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1462962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1463069Z kernel = self.compile( 2025-05-07T20:32:19.1463451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1463633Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1463845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1463850Z 2025-05-07T20:32:19.1464056Z self = 2025-05-07T20:32:19.1464838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1465338Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a892c00>} 2025-05-07T20:32:19.1466088Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1466288Z context = 2025-05-07T20:32:19.1466292Z 2025-05-07T20:32:19.1466458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1466726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1466841Z module_map=module_map) 2025-05-07T20:32:19.1467013Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1467116Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1467195Z E ^ 2025-05-07T20:32:19.1467557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1467564Z 2025-05-07T20:32:19.1467975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1467980Z 2025-05-07T20:32:19.1468097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1468324Z self=, 2025-05-07T20:32:19.1468404Z T=1, 2025-05-07T20:32:19.1468495Z D=7168, 2025-05-07T20:32:19.1468582Z scale_ub=1200.0, 2025-05-07T20:32:19.1468672Z contiguous=True, 2025-05-07T20:32:19.1468769Z compiled=True, 2025-05-07T20:32:19.1468848Z ) 2025-05-07T20:32:19.1469069Z self = 2025-05-07T20:32:19.1469247Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1469252Z 2025-05-07T20:32:19.1469332Z @given( 2025-05-07T20:32:19.1469464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1469617Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1469737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1469861Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1469973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1470060Z ) 2025-05-07T20:32:19.1470311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1470407Z def test_silu_mul_quant( 2025-05-07T20:32:19.1470490Z self, 2025-05-07T20:32:19.1470613Z T: int, 2025-05-07T20:32:19.1470695Z D: int, 2025-05-07T20:32:19.1470802Z scale_ub: Optional[float], 2025-05-07T20:32:19.1470893Z contiguous: bool, 2025-05-07T20:32:19.1470981Z compiled: bool, 2025-05-07T20:32:19.1471069Z ) -> None: 2025-05-07T20:32:19.1471165Z torch.manual_seed(2025) 2025-05-07T20:32:19.1471243Z 2025-05-07T20:32:19.1471419Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1471500Z 2025-05-07T20:32:19.1471596Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1471732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1471822Z x = x_sign * x_clamp 2025-05-07T20:32:19.1471910Z x0 = x[:, :D] 2025-05-07T20:32:19.1472070Z x1 = x[:, D:] 2025-05-07T20:32:19.1472146Z 2025-05-07T20:32:19.1472238Z if contiguous: 2025-05-07T20:32:19.1472332Z x0 = x0.contiguous() 2025-05-07T20:32:19.1472424Z x1 = x1.contiguous() 2025-05-07T20:32:19.1472510Z 2025-05-07T20:32:19.1472601Z if scale_ub is not None: 2025-05-07T20:32:19.1472710Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1472855Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1472933Z ) 2025-05-07T20:32:19.1473010Z else: 2025-05-07T20:32:19.1473113Z scale_ub_tensor = None 2025-05-07T20:32:19.1473187Z 2025-05-07T20:32:19.1473326Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1473419Z op = silu_mul_quant 2025-05-07T20:32:19.1473505Z if compiled: 2025-05-07T20:32:19.1473610Z op = torch.compile(op) 2025-05-07T20:32:19.1473720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1473794Z 2025-05-07T20:32:19.1473891Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1473896Z 2025-05-07T20:32:19.1473996Z moe/activation_test.py:117: 2025-05-07T20:32:19.1474125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1474235Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1474337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1474707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1474801Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1475291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1475398Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1475753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1475982Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1476325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1476424Z kernel = self.compile( 2025-05-07T20:32:19.1476810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1476984Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1477113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1477118Z 2025-05-07T20:32:19.1477329Z self = 2025-05-07T20:32:19.1478147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1478655Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a215080>} 2025-05-07T20:32:19.1479431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1479623Z context = 2025-05-07T20:32:19.1479634Z 2025-05-07T20:32:19.1479800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1480062Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1480178Z module_map=module_map) 2025-05-07T20:32:19.1480343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1480541Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1480627Z E ^ 2025-05-07T20:32:19.1480979Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1480990Z 2025-05-07T20:32:19.1481403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1481408Z 2025-05-07T20:32:19.1481512Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1481734Z self=, 2025-05-07T20:32:19.1481822Z T=1, 2025-05-07T20:32:19.1481899Z D=7168, 2025-05-07T20:32:19.1481988Z scale_ub=1200.0, 2025-05-07T20:32:19.1482081Z contiguous=False, 2025-05-07T20:32:19.1482165Z compiled=True, 2025-05-07T20:32:19.1482242Z ) 2025-05-07T20:32:19.1482464Z self = 2025-05-07T20:32:19.1482633Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1482638Z 2025-05-07T20:32:19.1482723Z @given( 2025-05-07T20:32:19.1482846Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1482948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1483074Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1483192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1483307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1483390Z ) 2025-05-07T20:32:19.1483634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1483736Z def test_silu_mul_quant( 2025-05-07T20:32:19.1483817Z self, 2025-05-07T20:32:19.1483895Z T: int, 2025-05-07T20:32:19.1483977Z D: int, 2025-05-07T20:32:19.1484077Z scale_ub: Optional[float], 2025-05-07T20:32:19.1484168Z contiguous: bool, 2025-05-07T20:32:19.1484262Z compiled: bool, 2025-05-07T20:32:19.1484346Z ) -> None: 2025-05-07T20:32:19.1484442Z torch.manual_seed(2025) 2025-05-07T20:32:19.1484524Z 2025-05-07T20:32:19.1484692Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1484771Z 2025-05-07T20:32:19.1484872Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1485001Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1485092Z x = x_sign * x_clamp 2025-05-07T20:32:19.1485182Z x0 = x[:, :D] 2025-05-07T20:32:19.1485264Z x1 = x[:, D:] 2025-05-07T20:32:19.1485345Z 2025-05-07T20:32:19.1485431Z if contiguous: 2025-05-07T20:32:19.1485523Z x0 = x0.contiguous() 2025-05-07T20:32:19.1485668Z x1 = x1.contiguous() 2025-05-07T20:32:19.1485743Z 2025-05-07T20:32:19.1485835Z if scale_ub is not None: 2025-05-07T20:32:19.1485949Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1486089Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1486167Z ) 2025-05-07T20:32:19.1486251Z else: 2025-05-07T20:32:19.1486347Z scale_ub_tensor = None 2025-05-07T20:32:19.1486421Z 2025-05-07T20:32:19.1486603Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1486695Z op = silu_mul_quant 2025-05-07T20:32:19.1486787Z if compiled: 2025-05-07T20:32:19.1486888Z op = torch.compile(op) 2025-05-07T20:32:19.1486995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1487076Z 2025-05-07T20:32:19.1487170Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1487175Z 2025-05-07T20:32:19.1487271Z moe/activation_test.py:117: 2025-05-07T20:32:19.1487414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1487515Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1487616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1488067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1488166Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1488668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1488773Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1489131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1489362Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1489699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1489796Z kernel = self.compile( 2025-05-07T20:32:19.1490182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1490358Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1490494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1490498Z 2025-05-07T20:32:19.1490702Z self = 2025-05-07T20:32:19.1491465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1491975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a217380>} 2025-05-07T20:32:19.1494212Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1494419Z context = 2025-05-07T20:32:19.1494424Z 2025-05-07T20:32:19.1494591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1494858Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1494968Z module_map=module_map) 2025-05-07T20:32:19.1495132Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1495239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1495318Z E ^ 2025-05-07T20:32:19.1495671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1495724Z 2025-05-07T20:32:19.1496139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1496143Z 2025-05-07T20:32:19.1496253Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1496481Z self=, 2025-05-07T20:32:19.1496562Z T=1, 2025-05-07T20:32:19.1496641Z D=7168, 2025-05-07T20:32:19.1496733Z scale_ub=None, 2025-05-07T20:32:19.1496863Z contiguous=False, 2025-05-07T20:32:19.1496949Z compiled=True, 2025-05-07T20:32:19.1497030Z ) 2025-05-07T20:32:19.1497249Z self = 2025-05-07T20:32:19.1497414Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1497426Z 2025-05-07T20:32:19.1497504Z @given( 2025-05-07T20:32:19.1497625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1497737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1497854Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1497973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1498172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1498552Z ) 2025-05-07T20:32:19.1498861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1498966Z def test_silu_mul_quant( 2025-05-07T20:32:19.1499050Z self, 2025-05-07T20:32:19.1499137Z T: int, 2025-05-07T20:32:19.1499216Z D: int, 2025-05-07T20:32:19.1499316Z scale_ub: Optional[float], 2025-05-07T20:32:19.1499415Z contiguous: bool, 2025-05-07T20:32:19.1499505Z compiled: bool, 2025-05-07T20:32:19.1499585Z ) -> None: 2025-05-07T20:32:19.1499688Z torch.manual_seed(2025) 2025-05-07T20:32:19.1499763Z 2025-05-07T20:32:19.1499935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1500016Z 2025-05-07T20:32:19.1500109Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1500236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1500331Z x = x_sign * x_clamp 2025-05-07T20:32:19.1500421Z x0 = x[:, :D] 2025-05-07T20:32:19.1500503Z x1 = x[:, D:] 2025-05-07T20:32:19.1500582Z 2025-05-07T20:32:19.1500669Z if contiguous: 2025-05-07T20:32:19.1500768Z x0 = x0.contiguous() 2025-05-07T20:32:19.1500861Z x1 = x1.contiguous() 2025-05-07T20:32:19.1500935Z 2025-05-07T20:32:19.1501038Z if scale_ub is not None: 2025-05-07T20:32:19.1501146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1501281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1501365Z ) 2025-05-07T20:32:19.1501443Z else: 2025-05-07T20:32:19.1501541Z scale_ub_tensor = None 2025-05-07T20:32:19.1501631Z 2025-05-07T20:32:19.1501765Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1501856Z op = silu_mul_quant 2025-05-07T20:32:19.1501949Z if compiled: 2025-05-07T20:32:19.1502051Z op = torch.compile(op) 2025-05-07T20:32:19.1502171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1502246Z 2025-05-07T20:32:19.1502341Z y_fp8, y_scale = fn() 2025-05-07T20:32:19.1502470Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:19.1502547Z 2025-05-07T20:32:19.1502684Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1502797Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:19.1502899Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:19.1503022Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:19.1503170Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1503400Z 2025-05-07T20:32:19.1503502Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:19.1503514Z 2025-05-07T20:32:19.1503613Z moe/activation_test.py:126: 2025-05-07T20:32:19.1503746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1503871Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:19.1504006Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:19.1504556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:19.1504778Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:19.1505136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1505366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1505729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:19.1505988Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:19.1506482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:19.1506653Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:19.1506990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:19.1507081Z fn() 2025-05-07T20:32:19.1507480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:19.1507569Z self.fn.run( 2025-05-07T20:32:19.1507905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1507999Z kernel = self.compile( 2025-05-07T20:32:19.1508382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1508560Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1508692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1508708Z 2025-05-07T20:32:19.1508913Z self = 2025-05-07T20:32:19.1509679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1510188Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5b1cc220>} 2025-05-07T20:32:19.1510919Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1511120Z context = 2025-05-07T20:32:19.1511124Z 2025-05-07T20:32:19.1511294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1511554Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1511671Z module_map=module_map) 2025-05-07T20:32:19.1511837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1511945Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:19.1512025Z E ^ 2025-05-07T20:32:19.1512376Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1512381Z 2025-05-07T20:32:19.1512794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1512845Z 2025-05-07T20:32:19.1512950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1513173Z self=, 2025-05-07T20:32:19.1513259Z T=1, 2025-05-07T20:32:19.1513343Z D=5120, 2025-05-07T20:32:19.1513439Z scale_ub=1200.0, 2025-05-07T20:32:19.1513528Z contiguous=False, 2025-05-07T20:32:19.1513613Z compiled=True, 2025-05-07T20:32:19.1513693Z ) 2025-05-07T20:32:19.1513955Z self = 2025-05-07T20:32:19.1514119Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1514124Z 2025-05-07T20:32:19.1514209Z @given( 2025-05-07T20:32:19.1514332Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1514432Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1514555Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1514679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1514799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1514877Z ) 2025-05-07T20:32:19.1515222Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1515327Z def test_silu_mul_quant( 2025-05-07T20:32:19.1515406Z self, 2025-05-07T20:32:19.1515487Z T: int, 2025-05-07T20:32:19.1515572Z D: int, 2025-05-07T20:32:19.1515675Z scale_ub: Optional[float], 2025-05-07T20:32:19.1515767Z contiguous: bool, 2025-05-07T20:32:19.1515860Z compiled: bool, 2025-05-07T20:32:19.1515940Z ) -> None: 2025-05-07T20:32:19.1516037Z torch.manual_seed(2025) 2025-05-07T20:32:19.1516119Z 2025-05-07T20:32:19.1516287Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1516368Z 2025-05-07T20:32:19.1516462Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1516596Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1516696Z x = x_sign * x_clamp 2025-05-07T20:32:19.1516780Z x0 = x[:, :D] 2025-05-07T20:32:19.1516862Z x1 = x[:, D:] 2025-05-07T20:32:19.1516944Z 2025-05-07T20:32:19.1517034Z if contiguous: 2025-05-07T20:32:19.1517127Z x0 = x0.contiguous() 2025-05-07T20:32:19.1517229Z x1 = x1.contiguous() 2025-05-07T20:32:19.1517311Z 2025-05-07T20:32:19.1517404Z if scale_ub is not None: 2025-05-07T20:32:19.1517513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1517658Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1517737Z ) 2025-05-07T20:32:19.1517814Z else: 2025-05-07T20:32:19.1517918Z scale_ub_tensor = None 2025-05-07T20:32:19.1517995Z 2025-05-07T20:32:19.1518129Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1518227Z op = silu_mul_quant 2025-05-07T20:32:19.1518317Z if compiled: 2025-05-07T20:32:19.1518426Z op = torch.compile(op) 2025-05-07T20:32:19.1518533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1518607Z 2025-05-07T20:32:19.1518707Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1518715Z 2025-05-07T20:32:19.1518814Z moe/activation_test.py:117: 2025-05-07T20:32:19.1518945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1519057Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1519162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1519530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1519626Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1520115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1520270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1520625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1520845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1521194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1521290Z kernel = self.compile( 2025-05-07T20:32:19.1521675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1521889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1522018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1522023Z 2025-05-07T20:32:19.1522234Z self = 2025-05-07T20:32:19.1523000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1523586Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9efb00>} 2025-05-07T20:32:19.1524320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1524516Z context = 2025-05-07T20:32:19.1524528Z 2025-05-07T20:32:19.1524693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1524952Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1525069Z module_map=module_map) 2025-05-07T20:32:19.1525230Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1525330Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1525414Z E ^ 2025-05-07T20:32:19.1525767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1525772Z 2025-05-07T20:32:19.1526185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1526192Z 2025-05-07T20:32:19.1526296Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1526517Z self=, 2025-05-07T20:32:19.1526601Z T=1, 2025-05-07T20:32:19.1526677Z D=5120, 2025-05-07T20:32:19.1526762Z scale_ub=1200.0, 2025-05-07T20:32:19.1526855Z contiguous=False, 2025-05-07T20:32:19.1526943Z compiled=False, 2025-05-07T20:32:19.1527017Z ) 2025-05-07T20:32:19.1527243Z self = 2025-05-07T20:32:19.1527408Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1527413Z 2025-05-07T20:32:19.1527501Z @given( 2025-05-07T20:32:19.1527621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1527721Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1527844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1527966Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1528081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1528167Z ) 2025-05-07T20:32:19.1528409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1528504Z def test_silu_mul_quant( 2025-05-07T20:32:19.1528589Z self, 2025-05-07T20:32:19.1528667Z T: int, 2025-05-07T20:32:19.1528800Z D: int, 2025-05-07T20:32:19.1528901Z scale_ub: Optional[float], 2025-05-07T20:32:19.1528991Z contiguous: bool, 2025-05-07T20:32:19.1529085Z compiled: bool, 2025-05-07T20:32:19.1529167Z ) -> None: 2025-05-07T20:32:19.1529268Z torch.manual_seed(2025) 2025-05-07T20:32:19.1529350Z 2025-05-07T20:32:19.1529518Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1529592Z 2025-05-07T20:32:19.1529691Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1529859Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1529949Z x = x_sign * x_clamp 2025-05-07T20:32:19.1530039Z x0 = x[:, :D] 2025-05-07T20:32:19.1530121Z x1 = x[:, D:] 2025-05-07T20:32:19.1530202Z 2025-05-07T20:32:19.1530287Z if contiguous: 2025-05-07T20:32:19.1530379Z x0 = x0.contiguous() 2025-05-07T20:32:19.1530476Z x1 = x1.contiguous() 2025-05-07T20:32:19.1530554Z 2025-05-07T20:32:19.1530647Z if scale_ub is not None: 2025-05-07T20:32:19.1530759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1530893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1530971Z ) 2025-05-07T20:32:19.1531127Z else: 2025-05-07T20:32:19.1531225Z scale_ub_tensor = None 2025-05-07T20:32:19.1531300Z 2025-05-07T20:32:19.1531436Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1531527Z op = silu_mul_quant 2025-05-07T20:32:19.1531617Z if compiled: 2025-05-07T20:32:19.1531724Z op = torch.compile(op) 2025-05-07T20:32:19.1531833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1531920Z 2025-05-07T20:32:19.1532013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1532017Z 2025-05-07T20:32:19.1532115Z moe/activation_test.py:117: 2025-05-07T20:32:19.1532252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1532357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1532458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1532961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1533061Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1533426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1533742Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1534081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1534186Z kernel = self.compile( 2025-05-07T20:32:19.1534567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1534743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1534878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1534882Z 2025-05-07T20:32:19.1535087Z self = 2025-05-07T20:32:19.1535860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1536365Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f56c0>} 2025-05-07T20:32:19.1537106Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1537348Z context = 2025-05-07T20:32:19.1537353Z 2025-05-07T20:32:19.1537519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1537791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1537900Z module_map=module_map) 2025-05-07T20:32:19.1538071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1538210Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1538288Z E ^ 2025-05-07T20:32:19.1538642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1538647Z 2025-05-07T20:32:19.1539052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1539057Z 2025-05-07T20:32:19.1539160Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1539394Z self=, 2025-05-07T20:32:19.1539473Z T=16384, 2025-05-07T20:32:19.1539555Z D=5120, 2025-05-07T20:32:19.1539641Z scale_ub=1200.0, 2025-05-07T20:32:19.1539804Z contiguous=False, 2025-05-07T20:32:19.1539896Z compiled=True, 2025-05-07T20:32:19.1539972Z ) 2025-05-07T20:32:19.1540189Z self = 2025-05-07T20:32:19.1540393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1540402Z 2025-05-07T20:32:19.1540483Z @given( 2025-05-07T20:32:19.1540627Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1540734Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1540850Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1540973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1541090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1541165Z ) 2025-05-07T20:32:19.1541413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1541507Z def test_silu_mul_quant( 2025-05-07T20:32:19.1541586Z self, 2025-05-07T20:32:19.1541674Z T: int, 2025-05-07T20:32:19.1541754Z D: int, 2025-05-07T20:32:19.1541853Z scale_ub: Optional[float], 2025-05-07T20:32:19.1541951Z contiguous: bool, 2025-05-07T20:32:19.1542037Z compiled: bool, 2025-05-07T20:32:19.1542119Z ) -> None: 2025-05-07T20:32:19.1542222Z torch.manual_seed(2025) 2025-05-07T20:32:19.1542296Z 2025-05-07T20:32:19.1542473Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1542548Z 2025-05-07T20:32:19.1542641Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1542772Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1542862Z x = x_sign * x_clamp 2025-05-07T20:32:19.1542948Z x0 = x[:, :D] 2025-05-07T20:32:19.1543037Z x1 = x[:, D:] 2025-05-07T20:32:19.1543111Z 2025-05-07T20:32:19.1543197Z if contiguous: 2025-05-07T20:32:19.1543296Z x0 = x0.contiguous() 2025-05-07T20:32:19.1543392Z x1 = x1.contiguous() 2025-05-07T20:32:19.1543468Z 2025-05-07T20:32:19.1543565Z if scale_ub is not None: 2025-05-07T20:32:19.1543670Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1543813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1543891Z ) 2025-05-07T20:32:19.1543968Z else: 2025-05-07T20:32:19.1544070Z scale_ub_tensor = None 2025-05-07T20:32:19.1544143Z 2025-05-07T20:32:19.1544273Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1544369Z op = silu_mul_quant 2025-05-07T20:32:19.1544453Z if compiled: 2025-05-07T20:32:19.1544556Z op = torch.compile(op) 2025-05-07T20:32:19.1544718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1544793Z 2025-05-07T20:32:19.1544886Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1544890Z 2025-05-07T20:32:19.1545000Z moe/activation_test.py:117: 2025-05-07T20:32:19.1545137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1545244Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1545347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1545708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1545875Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1546362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1546462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1546825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1547051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1547397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1547568Z kernel = self.compile( 2025-05-07T20:32:19.1547951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1548134Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1548267Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1548271Z 2025-05-07T20:32:19.1548484Z self = 2025-05-07T20:32:19.1549248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1549752Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fce5a9f6fc0>} 2025-05-07T20:32:19.1550499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1550693Z context = 2025-05-07T20:32:19.1550698Z 2025-05-07T20:32:19.1550868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1551128Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1551237Z module_map=module_map) 2025-05-07T20:32:19.1551409Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1551512Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1551598Z E ^ 2025-05-07T20:32:19.1551948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1551956Z 2025-05-07T20:32:19.1552360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1552365Z 2025-05-07T20:32:19.1552474Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1552700Z self=, 2025-05-07T20:32:19.1552779Z T=2048, 2025-05-07T20:32:19.1552866Z D=7168, 2025-05-07T20:32:19.1552950Z scale_ub=1200.0, 2025-05-07T20:32:19.1553045Z contiguous=False, 2025-05-07T20:32:19.1553131Z compiled=True, 2025-05-07T20:32:19.1553205Z ) 2025-05-07T20:32:19.1553430Z self = 2025-05-07T20:32:19.1553653Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1553658Z 2025-05-07T20:32:19.1553735Z @given( 2025-05-07T20:32:19.1553864Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1553971Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1554087Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1554212Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1554369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1554450Z ) 2025-05-07T20:32:19.1554693Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1554787Z def test_silu_mul_quant( 2025-05-07T20:32:19.1554872Z self, 2025-05-07T20:32:19.1554950Z T: int, 2025-05-07T20:32:19.1555027Z D: int, 2025-05-07T20:32:19.1555133Z scale_ub: Optional[float], 2025-05-07T20:32:19.1555229Z contiguous: bool, 2025-05-07T20:32:19.1555315Z compiled: bool, 2025-05-07T20:32:19.1555400Z ) -> None: 2025-05-07T20:32:19.1555497Z torch.manual_seed(2025) 2025-05-07T20:32:19.1555571Z 2025-05-07T20:32:19.1555817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1555894Z 2025-05-07T20:32:19.1555992Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1556116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1556209Z x = x_sign * x_clamp 2025-05-07T20:32:19.1556299Z x0 = x[:, :D] 2025-05-07T20:32:19.1556380Z x1 = x[:, D:] 2025-05-07T20:32:19.1556454Z 2025-05-07T20:32:19.1556543Z if contiguous: 2025-05-07T20:32:19.1556635Z x0 = x0.contiguous() 2025-05-07T20:32:19.1556726Z x1 = x1.contiguous() 2025-05-07T20:32:19.1556807Z 2025-05-07T20:32:19.1556900Z if scale_ub is not None: 2025-05-07T20:32:19.1557006Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1557149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1557226Z ) 2025-05-07T20:32:19.1557313Z else: 2025-05-07T20:32:19.1557409Z scale_ub_tensor = None 2025-05-07T20:32:19.1557482Z 2025-05-07T20:32:19.1557623Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1557716Z op = silu_mul_quant 2025-05-07T20:32:19.1557805Z if compiled: 2025-05-07T20:32:19.1557914Z op = torch.compile(op) 2025-05-07T20:32:19.1558022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1558096Z 2025-05-07T20:32:19.1558194Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1558199Z 2025-05-07T20:32:19.1558296Z moe/activation_test.py:117: 2025-05-07T20:32:19.1558425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1558538Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1558641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1559007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1559102Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1559594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1559702Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1560058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1560285Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1560621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1560716Z kernel = self.compile( 2025-05-07T20:32:19.1561101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1561369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1561499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1561503Z 2025-05-07T20:32:19.1561722Z self = 2025-05-07T20:32:19.1562487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1563036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de58a0>} 2025-05-07T20:32:19.1563768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1563972Z context = 2025-05-07T20:32:19.1563977Z 2025-05-07T20:32:19.1564141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1564475Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1564594Z module_map=module_map) 2025-05-07T20:32:19.1564758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1564862Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1564948Z E ^ 2025-05-07T20:32:19.1565302Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1565306Z 2025-05-07T20:32:19.1565724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1565731Z 2025-05-07T20:32:19.1565835Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1566055Z self=, 2025-05-07T20:32:19.1566141Z T=1, 2025-05-07T20:32:19.1566222Z D=5120, 2025-05-07T20:32:19.1566312Z scale_ub=None, 2025-05-07T20:32:19.1566407Z contiguous=False, 2025-05-07T20:32:19.1566492Z compiled=False, 2025-05-07T20:32:19.1566573Z ) 2025-05-07T20:32:19.1566794Z self = 2025-05-07T20:32:19.1566962Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1566967Z 2025-05-07T20:32:19.1567050Z @given( 2025-05-07T20:32:19.1567171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1567272Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1567394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1567513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1567629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1567713Z ) 2025-05-07T20:32:19.1567958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1568064Z def test_silu_mul_quant( 2025-05-07T20:32:19.1568141Z self, 2025-05-07T20:32:19.1568221Z T: int, 2025-05-07T20:32:19.1568305Z D: int, 2025-05-07T20:32:19.1568405Z scale_ub: Optional[float], 2025-05-07T20:32:19.1568500Z contiguous: bool, 2025-05-07T20:32:19.1568594Z compiled: bool, 2025-05-07T20:32:19.1568674Z ) -> None: 2025-05-07T20:32:19.1568772Z torch.manual_seed(2025) 2025-05-07T20:32:19.1568856Z 2025-05-07T20:32:19.1569023Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1569098Z 2025-05-07T20:32:19.1569196Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1569321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1569471Z x = x_sign * x_clamp 2025-05-07T20:32:19.1569553Z x0 = x[:, :D] 2025-05-07T20:32:19.1569634Z x1 = x[:, D:] 2025-05-07T20:32:19.1569719Z 2025-05-07T20:32:19.1569803Z if contiguous: 2025-05-07T20:32:19.1569900Z x0 = x0.contiguous() 2025-05-07T20:32:19.1569996Z x1 = x1.contiguous() 2025-05-07T20:32:19.1570071Z 2025-05-07T20:32:19.1570163Z if scale_ub is not None: 2025-05-07T20:32:19.1570280Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1570485Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1570570Z ) 2025-05-07T20:32:19.1570670Z else: 2025-05-07T20:32:19.1570773Z scale_ub_tensor = None 2025-05-07T20:32:19.1570854Z 2025-05-07T20:32:19.1570983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1571075Z op = silu_mul_quant 2025-05-07T20:32:19.1571167Z if compiled: 2025-05-07T20:32:19.1571271Z op = torch.compile(op) 2025-05-07T20:32:19.1571380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1571461Z 2025-05-07T20:32:19.1571553Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1571558Z 2025-05-07T20:32:19.1571816Z moe/activation_test.py:117: 2025-05-07T20:32:19.1571957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1572059Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1572170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1572665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1572764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1573126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1573344Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1573776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1573880Z kernel = self.compile( 2025-05-07T20:32:19.1574265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1574445Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1574573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1574583Z 2025-05-07T20:32:19.1574788Z self = 2025-05-07T20:32:19.1575556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1576060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de53a0>} 2025-05-07T20:32:19.1576801Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1576993Z context = 2025-05-07T20:32:19.1576998Z 2025-05-07T20:32:19.1577163Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1577428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1577536Z module_map=module_map) 2025-05-07T20:32:19.1577704Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1577803Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1577954Z E ^ 2025-05-07T20:32:19.1578308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1578313Z 2025-05-07T20:32:19.1578726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1578730Z 2025-05-07T20:32:19.1578839Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1579060Z self=, 2025-05-07T20:32:19.1579179Z T=4096, 2025-05-07T20:32:19.1579263Z D=7168, 2025-05-07T20:32:19.1579358Z scale_ub=1200.0, 2025-05-07T20:32:19.1579449Z contiguous=False, 2025-05-07T20:32:19.1579543Z compiled=False, 2025-05-07T20:32:19.1579617Z ) 2025-05-07T20:32:19.1579835Z self = 2025-05-07T20:32:19.1580022Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1580029Z 2025-05-07T20:32:19.1580108Z @given( 2025-05-07T20:32:19.1580235Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1580335Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1586326Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1586593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1586715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1586803Z ) 2025-05-07T20:32:19.1587050Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1587152Z def test_silu_mul_quant( 2025-05-07T20:32:19.1587240Z self, 2025-05-07T20:32:19.1587319Z T: int, 2025-05-07T20:32:19.1587398Z D: int, 2025-05-07T20:32:19.1587506Z scale_ub: Optional[float], 2025-05-07T20:32:19.1587599Z contiguous: bool, 2025-05-07T20:32:19.1587696Z compiled: bool, 2025-05-07T20:32:19.1587777Z ) -> None: 2025-05-07T20:32:19.1587878Z torch.manual_seed(2025) 2025-05-07T20:32:19.1587961Z 2025-05-07T20:32:19.1588134Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1588209Z 2025-05-07T20:32:19.1588310Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1588441Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1588533Z x = x_sign * x_clamp 2025-05-07T20:32:19.1588624Z x0 = x[:, :D] 2025-05-07T20:32:19.1588707Z x1 = x[:, D:] 2025-05-07T20:32:19.1588784Z 2025-05-07T20:32:19.1588880Z if contiguous: 2025-05-07T20:32:19.1588977Z x0 = x0.contiguous() 2025-05-07T20:32:19.1589069Z x1 = x1.contiguous() 2025-05-07T20:32:19.1589154Z 2025-05-07T20:32:19.1589248Z if scale_ub is not None: 2025-05-07T20:32:19.1589365Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1589502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1589583Z ) 2025-05-07T20:32:19.1589671Z else: 2025-05-07T20:32:19.1589769Z scale_ub_tensor = None 2025-05-07T20:32:19.1589844Z 2025-05-07T20:32:19.1589989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1590085Z op = silu_mul_quant 2025-05-07T20:32:19.1590177Z if compiled: 2025-05-07T20:32:19.1590290Z op = torch.compile(op) 2025-05-07T20:32:19.1590400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1590475Z 2025-05-07T20:32:19.1590578Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1590583Z 2025-05-07T20:32:19.1590681Z moe/activation_test.py:117: 2025-05-07T20:32:19.1590826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1590931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1591034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1591541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1591699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1592058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1592294Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1592632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1592781Z kernel = self.compile( 2025-05-07T20:32:19.1593165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1593341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1593478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1593482Z 2025-05-07T20:32:19.1593689Z self = 2025-05-07T20:32:19.1594549Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1595052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5de63e0>} 2025-05-07T20:32:19.1595793Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1595997Z context = 2025-05-07T20:32:19.1596001Z 2025-05-07T20:32:19.1596169Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1596445Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1596558Z module_map=module_map) 2025-05-07T20:32:19.1596721Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1596840Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1596920Z E ^ 2025-05-07T20:32:19.1597282Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1597289Z 2025-05-07T20:32:19.1597695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1597700Z 2025-05-07T20:32:19.1597807Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1598040Z self=, 2025-05-07T20:32:19.1598121Z T=16384, 2025-05-07T20:32:19.1598484Z D=7168, 2025-05-07T20:32:19.1598620Z scale_ub=None, 2025-05-07T20:32:19.1598745Z contiguous=True, 2025-05-07T20:32:19.1598866Z compiled=True, 2025-05-07T20:32:19.1598946Z ) 2025-05-07T20:32:19.1599166Z self = 2025-05-07T20:32:19.1599354Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1599358Z 2025-05-07T20:32:19.1599442Z @given( 2025-05-07T20:32:19.1599563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1599671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1599787Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1599915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1600028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1600105Z ) 2025-05-07T20:32:19.1600357Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1600454Z def test_silu_mul_quant( 2025-05-07T20:32:19.1600701Z self, 2025-05-07T20:32:19.1600789Z T: int, 2025-05-07T20:32:19.1600867Z D: int, 2025-05-07T20:32:19.1600966Z scale_ub: Optional[float], 2025-05-07T20:32:19.1601064Z contiguous: bool, 2025-05-07T20:32:19.1601150Z compiled: bool, 2025-05-07T20:32:19.1601245Z ) -> None: 2025-05-07T20:32:19.1601339Z torch.manual_seed(2025) 2025-05-07T20:32:19.1601412Z 2025-05-07T20:32:19.1601586Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1601738Z 2025-05-07T20:32:19.1601830Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1601964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1602056Z x = x_sign * x_clamp 2025-05-07T20:32:19.1602138Z x0 = x[:, :D] 2025-05-07T20:32:19.1602226Z x1 = x[:, D:] 2025-05-07T20:32:19.1602300Z 2025-05-07T20:32:19.1602385Z if contiguous: 2025-05-07T20:32:19.1602485Z x0 = x0.contiguous() 2025-05-07T20:32:19.1602581Z x1 = x1.contiguous() 2025-05-07T20:32:19.1602663Z 2025-05-07T20:32:19.1602756Z if scale_ub is not None: 2025-05-07T20:32:19.1602863Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1603133Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1603212Z ) 2025-05-07T20:32:19.1603291Z else: 2025-05-07T20:32:19.1603392Z scale_ub_tensor = None 2025-05-07T20:32:19.1603465Z 2025-05-07T20:32:19.1603592Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1603693Z op = silu_mul_quant 2025-05-07T20:32:19.1603777Z if compiled: 2025-05-07T20:32:19.1603876Z op = torch.compile(op) 2025-05-07T20:32:19.1603986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1604059Z 2025-05-07T20:32:19.1604155Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1604159Z 2025-05-07T20:32:19.1604257Z moe/activation_test.py:117: 2025-05-07T20:32:19.1604389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1604498Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1604598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1604967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1605070Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1605557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1605663Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1606016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1606235Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1606577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1606674Z kernel = self.compile( 2025-05-07T20:32:19.1607053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1607239Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1607367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1607371Z 2025-05-07T20:32:19.1607582Z self = 2025-05-07T20:32:19.1608346Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1608847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4ef2a20>} 2025-05-07T20:32:19.1609635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1609832Z context = 2025-05-07T20:32:19.1609836Z 2025-05-07T20:32:19.1610009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1610271Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1610433Z module_map=module_map) 2025-05-07T20:32:19.1610594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1610693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1610776Z E ^ 2025-05-07T20:32:19.1611126Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1611134Z 2025-05-07T20:32:19.1611541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1611545Z 2025-05-07T20:32:19.1611763Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1611984Z self=, 2025-05-07T20:32:19.1612067Z T=4096, 2025-05-07T20:32:19.1612143Z D=5120, 2025-05-07T20:32:19.1612225Z scale_ub=None, 2025-05-07T20:32:19.1612320Z contiguous=False, 2025-05-07T20:32:19.1612403Z compiled=True, 2025-05-07T20:32:19.1612477Z ) 2025-05-07T20:32:19.1612701Z self = 2025-05-07T20:32:19.1612869Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1612874Z 2025-05-07T20:32:19.1612950Z @given( 2025-05-07T20:32:19.1613074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1613177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1613301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1613418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1613535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1613618Z ) 2025-05-07T20:32:19.1613970Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1614065Z def test_silu_mul_quant( 2025-05-07T20:32:19.1614152Z self, 2025-05-07T20:32:19.1614230Z T: int, 2025-05-07T20:32:19.1614312Z D: int, 2025-05-07T20:32:19.1614416Z scale_ub: Optional[float], 2025-05-07T20:32:19.1614506Z contiguous: bool, 2025-05-07T20:32:19.1614591Z compiled: bool, 2025-05-07T20:32:19.1614678Z ) -> None: 2025-05-07T20:32:19.1614773Z torch.manual_seed(2025) 2025-05-07T20:32:19.1614852Z 2025-05-07T20:32:19.1615019Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1615094Z 2025-05-07T20:32:19.1615193Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1615315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1615404Z x = x_sign * x_clamp 2025-05-07T20:32:19.1615499Z x0 = x[:, :D] 2025-05-07T20:32:19.1615583Z x1 = x[:, D:] 2025-05-07T20:32:19.1615656Z 2025-05-07T20:32:19.1615747Z if contiguous: 2025-05-07T20:32:19.1615838Z x0 = x0.contiguous() 2025-05-07T20:32:19.1615928Z x1 = x1.contiguous() 2025-05-07T20:32:19.1616007Z 2025-05-07T20:32:19.1616099Z if scale_ub is not None: 2025-05-07T20:32:19.1616210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1616344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1616420Z ) 2025-05-07T20:32:19.1616506Z else: 2025-05-07T20:32:19.1616601Z scale_ub_tensor = None 2025-05-07T20:32:19.1616733Z 2025-05-07T20:32:19.1616867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1616957Z op = silu_mul_quant 2025-05-07T20:32:19.1617042Z if compiled: 2025-05-07T20:32:19.1617149Z op = torch.compile(op) 2025-05-07T20:32:19.1617262Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1617337Z 2025-05-07T20:32:19.1617438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1617442Z 2025-05-07T20:32:19.1617542Z moe/activation_test.py:117: 2025-05-07T20:32:19.1617723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1617823Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1617922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1618291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1618384Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1618873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1618977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1619408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1619635Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1619971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1620067Z kernel = self.compile( 2025-05-07T20:32:19.1620452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1620626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1620753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1620767Z 2025-05-07T20:32:19.1620971Z self = 2025-05-07T20:32:19.1621736Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1622245Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4ef3c40>} 2025-05-07T20:32:19.1622982Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1623178Z context = 2025-05-07T20:32:19.1623183Z 2025-05-07T20:32:19.1623348Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1623608Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1623723Z module_map=module_map) 2025-05-07T20:32:19.1623889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1624000Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1624078Z E ^ 2025-05-07T20:32:19.1624425Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1624432Z 2025-05-07T20:32:19.1624844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1624849Z 2025-05-07T20:32:19.1624951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1625171Z self=, 2025-05-07T20:32:19.1625256Z T=4096, 2025-05-07T20:32:19.1625380Z D=5120, 2025-05-07T20:32:19.1625471Z scale_ub=1200.0, 2025-05-07T20:32:19.1625559Z contiguous=False, 2025-05-07T20:32:19.1625643Z compiled=False, 2025-05-07T20:32:19.1625722Z ) 2025-05-07T20:32:19.1625937Z self = 2025-05-07T20:32:19.1626117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1626121Z 2025-05-07T20:32:19.1626203Z @given( 2025-05-07T20:32:19.1626321Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1626460Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1626583Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1626699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1626818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1626892Z ) 2025-05-07T20:32:19.1627134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1627237Z def test_silu_mul_quant( 2025-05-07T20:32:19.1627318Z self, 2025-05-07T20:32:19.1627395Z T: int, 2025-05-07T20:32:19.1627479Z D: int, 2025-05-07T20:32:19.1627577Z scale_ub: Optional[float], 2025-05-07T20:32:19.1627748Z contiguous: bool, 2025-05-07T20:32:19.1627843Z compiled: bool, 2025-05-07T20:32:19.1627922Z ) -> None: 2025-05-07T20:32:19.1628017Z torch.manual_seed(2025) 2025-05-07T20:32:19.1628096Z 2025-05-07T20:32:19.1628264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1628347Z 2025-05-07T20:32:19.1628440Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1628563Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1628657Z x = x_sign * x_clamp 2025-05-07T20:32:19.1628738Z x0 = x[:, :D] 2025-05-07T20:32:19.1628819Z x1 = x[:, D:] 2025-05-07T20:32:19.1628899Z 2025-05-07T20:32:19.1628983Z if contiguous: 2025-05-07T20:32:19.1629078Z x0 = x0.contiguous() 2025-05-07T20:32:19.1629174Z x1 = x1.contiguous() 2025-05-07T20:32:19.1629248Z 2025-05-07T20:32:19.1629340Z if scale_ub is not None: 2025-05-07T20:32:19.1629457Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1629597Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1629683Z ) 2025-05-07T20:32:19.1629760Z else: 2025-05-07T20:32:19.1629855Z scale_ub_tensor = None 2025-05-07T20:32:19.1629938Z 2025-05-07T20:32:19.1630067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1630161Z op = silu_mul_quant 2025-05-07T20:32:19.1630253Z if compiled: 2025-05-07T20:32:19.1630352Z op = torch.compile(op) 2025-05-07T20:32:19.1630460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1630542Z 2025-05-07T20:32:19.1630633Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1630639Z 2025-05-07T20:32:19.1630737Z moe/activation_test.py:117: 2025-05-07T20:32:19.1630874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1630975Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1631088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1631576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1631676Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1632043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1632262Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1632607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1632702Z kernel = self.compile( 2025-05-07T20:32:19.1633135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1633315Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1633447Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1633451Z 2025-05-07T20:32:19.1633655Z self = 2025-05-07T20:32:19.1634423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1634964Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda52582c0>} 2025-05-07T20:32:19.1635704Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1635897Z context = 2025-05-07T20:32:19.1635902Z 2025-05-07T20:32:19.1636150Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1636410Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1636518Z module_map=module_map) 2025-05-07T20:32:19.1636687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1636785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1636865Z E ^ 2025-05-07T20:32:19.1637218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1637223Z 2025-05-07T20:32:19.1637630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1637638Z 2025-05-07T20:32:19.1637746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1637966Z self=, 2025-05-07T20:32:19.1638048Z T=4096, 2025-05-07T20:32:19.1638131Z D=5120, 2025-05-07T20:32:19.1638215Z scale_ub=1200.0, 2025-05-07T20:32:19.1638302Z contiguous=False, 2025-05-07T20:32:19.1638392Z compiled=True, 2025-05-07T20:32:19.1638467Z ) 2025-05-07T20:32:19.1638690Z self = 2025-05-07T20:32:19.1638861Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1638865Z 2025-05-07T20:32:19.1638942Z @given( 2025-05-07T20:32:19.1639065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1639164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1639280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1639405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1639517Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1639593Z ) 2025-05-07T20:32:19.1639849Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1639943Z def test_silu_mul_quant( 2025-05-07T20:32:19.1640027Z self, 2025-05-07T20:32:19.1640105Z T: int, 2025-05-07T20:32:19.1640181Z D: int, 2025-05-07T20:32:19.1640288Z scale_ub: Optional[float], 2025-05-07T20:32:19.1640379Z contiguous: bool, 2025-05-07T20:32:19.1640465Z compiled: bool, 2025-05-07T20:32:19.1640549Z ) -> None: 2025-05-07T20:32:19.1640645Z torch.manual_seed(2025) 2025-05-07T20:32:19.1640719Z 2025-05-07T20:32:19.1640893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1640971Z 2025-05-07T20:32:19.1641114Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1641247Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1641337Z x = x_sign * x_clamp 2025-05-07T20:32:19.1641424Z x0 = x[:, :D] 2025-05-07T20:32:19.1641508Z x1 = x[:, D:] 2025-05-07T20:32:19.1641581Z 2025-05-07T20:32:19.1641677Z if contiguous: 2025-05-07T20:32:19.1641768Z x0 = x0.contiguous() 2025-05-07T20:32:19.1641858Z x1 = x1.contiguous() 2025-05-07T20:32:19.1641936Z 2025-05-07T20:32:19.1642027Z if scale_ub is not None: 2025-05-07T20:32:19.1642178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1642317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1642392Z ) 2025-05-07T20:32:19.1642469Z else: 2025-05-07T20:32:19.1642571Z scale_ub_tensor = None 2025-05-07T20:32:19.1642648Z 2025-05-07T20:32:19.1642777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1642878Z op = silu_mul_quant 2025-05-07T20:32:19.1642964Z if compiled: 2025-05-07T20:32:19.1643072Z op = torch.compile(op) 2025-05-07T20:32:19.1643177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1643249Z 2025-05-07T20:32:19.1643469Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1643474Z 2025-05-07T20:32:19.1643580Z moe/activation_test.py:117: 2025-05-07T20:32:19.1643710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1643815Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1643921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1644281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1644378Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1644873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1644977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1645339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1645562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1645899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1646004Z kernel = self.compile( 2025-05-07T20:32:19.1646386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1646559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1646696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1646701Z 2025-05-07T20:32:19.1646905Z self = 2025-05-07T20:32:19.1647674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1648177Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5259b20>} 2025-05-07T20:32:19.1648914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1649106Z context = 2025-05-07T20:32:19.1649111Z 2025-05-07T20:32:19.1649275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1649542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1649699Z module_map=module_map) 2025-05-07T20:32:19.1649868Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1649970Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1650048Z E ^ 2025-05-07T20:32:19.1650410Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1650414Z 2025-05-07T20:32:19.1650820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1650865Z 2025-05-07T20:32:19.1650970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1651197Z self=, 2025-05-07T20:32:19.1651276Z T=2048, 2025-05-07T20:32:19.1651359Z D=7168, 2025-05-07T20:32:19.1651442Z scale_ub=1200.0, 2025-05-07T20:32:19.1651529Z contiguous=False, 2025-05-07T20:32:19.1651622Z compiled=False, 2025-05-07T20:32:19.1651696Z ) 2025-05-07T20:32:19.1651913Z self = 2025-05-07T20:32:19.1652093Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1652180Z 2025-05-07T20:32:19.1652259Z @given( 2025-05-07T20:32:19.1652376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1652482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1652599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1652722Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1652836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1652910Z ) 2025-05-07T20:32:19.1653157Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1653251Z def test_silu_mul_quant( 2025-05-07T20:32:19.1653327Z self, 2025-05-07T20:32:19.1653413Z T: int, 2025-05-07T20:32:19.1653491Z D: int, 2025-05-07T20:32:19.1653588Z scale_ub: Optional[float], 2025-05-07T20:32:19.1653821Z contiguous: bool, 2025-05-07T20:32:19.1653911Z compiled: bool, 2025-05-07T20:32:19.1653990Z ) -> None: 2025-05-07T20:32:19.1654101Z torch.manual_seed(2025) 2025-05-07T20:32:19.1654176Z 2025-05-07T20:32:19.1654369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1654443Z 2025-05-07T20:32:19.1654543Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1654685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1654776Z x = x_sign * x_clamp 2025-05-07T20:32:19.1654860Z x0 = x[:, :D] 2025-05-07T20:32:19.1654954Z x1 = x[:, D:] 2025-05-07T20:32:19.1655027Z 2025-05-07T20:32:19.1655113Z if contiguous: 2025-05-07T20:32:19.1655214Z x0 = x0.contiguous() 2025-05-07T20:32:19.1655307Z x1 = x1.contiguous() 2025-05-07T20:32:19.1655383Z 2025-05-07T20:32:19.1655487Z if scale_ub is not None: 2025-05-07T20:32:19.1655597Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1655746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1655824Z ) 2025-05-07T20:32:19.1655907Z else: 2025-05-07T20:32:19.1656011Z scale_ub_tensor = None 2025-05-07T20:32:19.1656086Z 2025-05-07T20:32:19.1656222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1656323Z op = silu_mul_quant 2025-05-07T20:32:19.1656410Z if compiled: 2025-05-07T20:32:19.1656514Z op = torch.compile(op) 2025-05-07T20:32:19.1656632Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1656707Z 2025-05-07T20:32:19.1656801Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1656814Z 2025-05-07T20:32:19.1656917Z moe/activation_test.py:117: 2025-05-07T20:32:19.1657058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1657224Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1657324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1657818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1657922Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1658278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1658539Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1658883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1658977Z kernel = self.compile( 2025-05-07T20:32:19.1659362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1659537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1659665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1659669Z 2025-05-07T20:32:19.1659955Z self = 2025-05-07T20:32:19.1660723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1661235Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda525a700>} 2025-05-07T20:32:19.1661968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1662170Z context = 2025-05-07T20:32:19.1662174Z 2025-05-07T20:32:19.1662342Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1662604Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1662720Z module_map=module_map) 2025-05-07T20:32:19.1662887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1662990Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1663075Z E ^ 2025-05-07T20:32:19.1663424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1663428Z 2025-05-07T20:32:19.1663840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1663847Z 2025-05-07T20:32:19.1663948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1664168Z self=, 2025-05-07T20:32:19.1664252Z T=1, 2025-05-07T20:32:19.1664328Z D=7168, 2025-05-07T20:32:19.1664412Z scale_ub=None, 2025-05-07T20:32:19.1664510Z contiguous=True, 2025-05-07T20:32:19.1664595Z compiled=False, 2025-05-07T20:32:19.1664672Z ) 2025-05-07T20:32:19.1664889Z self = 2025-05-07T20:32:19.1665052Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.1665057Z 2025-05-07T20:32:19.1665141Z @given( 2025-05-07T20:32:19.1665259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1665357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1665479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1665596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1665754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1665838Z ) 2025-05-07T20:32:19.1666078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1666178Z def test_silu_mul_quant( 2025-05-07T20:32:19.1666261Z self, 2025-05-07T20:32:19.1666339Z T: int, 2025-05-07T20:32:19.1666424Z D: int, 2025-05-07T20:32:19.1666522Z scale_ub: Optional[float], 2025-05-07T20:32:19.1666611Z contiguous: bool, 2025-05-07T20:32:19.1666749Z compiled: bool, 2025-05-07T20:32:19.1666828Z ) -> None: 2025-05-07T20:32:19.1666921Z torch.manual_seed(2025) 2025-05-07T20:32:19.1667002Z 2025-05-07T20:32:19.1667169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1667242Z 2025-05-07T20:32:19.1667340Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1667462Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1667560Z x = x_sign * x_clamp 2025-05-07T20:32:19.1667643Z x0 = x[:, :D] 2025-05-07T20:32:19.1667724Z x1 = x[:, D:] 2025-05-07T20:32:19.1667802Z 2025-05-07T20:32:19.1667885Z if contiguous: 2025-05-07T20:32:19.1667977Z x0 = x0.contiguous() 2025-05-07T20:32:19.1668153Z x1 = x1.contiguous() 2025-05-07T20:32:19.1668228Z 2025-05-07T20:32:19.1668319Z if scale_ub is not None: 2025-05-07T20:32:19.1668429Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1668561Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1668640Z ) 2025-05-07T20:32:19.1668723Z else: 2025-05-07T20:32:19.1668818Z scale_ub_tensor = None 2025-05-07T20:32:19.1668891Z 2025-05-07T20:32:19.1669028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1669121Z op = silu_mul_quant 2025-05-07T20:32:19.1669214Z if compiled: 2025-05-07T20:32:19.1669315Z op = torch.compile(op) 2025-05-07T20:32:19.1669421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1669501Z 2025-05-07T20:32:19.1669591Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1669596Z 2025-05-07T20:32:19.1669693Z moe/activation_test.py:117: 2025-05-07T20:32:19.1669832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1669932Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1670031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1670530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1670626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1670987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1671205Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1671544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1671647Z kernel = self.compile( 2025-05-07T20:32:19.1672030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1672209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1672337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1672344Z 2025-05-07T20:32:19.1672546Z self = 2025-05-07T20:32:19.1673315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1673812Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda525ba60>} 2025-05-07T20:32:19.1674610Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1674799Z context = 2025-05-07T20:32:19.1674804Z 2025-05-07T20:32:19.1674967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1675305Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1675414Z module_map=module_map) 2025-05-07T20:32:19.1675580Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1675679Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1675759Z E ^ 2025-05-07T20:32:19.1676117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1676121Z 2025-05-07T20:32:19.1676528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1676607Z 2025-05-07T20:32:19.1676723Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1676942Z self=, 2025-05-07T20:32:19.1677021Z T=16384, 2025-05-07T20:32:19.1677109Z D=7168, 2025-05-07T20:32:19.1677194Z scale_ub=1200.0, 2025-05-07T20:32:19.1677282Z contiguous=False, 2025-05-07T20:32:19.1677373Z compiled=True, 2025-05-07T20:32:19.1677447Z ) 2025-05-07T20:32:19.1677662Z self = 2025-05-07T20:32:19.1677846Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1677851Z 2025-05-07T20:32:19.1677931Z @given( 2025-05-07T20:32:19.1678058Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1678158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1678273Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1678403Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1678515Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1678590Z ) 2025-05-07T20:32:19.1678838Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1678936Z def test_silu_mul_quant( 2025-05-07T20:32:19.1679013Z self, 2025-05-07T20:32:19.1679096Z T: int, 2025-05-07T20:32:19.1679173Z D: int, 2025-05-07T20:32:19.1679271Z scale_ub: Optional[float], 2025-05-07T20:32:19.1679367Z contiguous: bool, 2025-05-07T20:32:19.1679453Z compiled: bool, 2025-05-07T20:32:19.1679538Z ) -> None: 2025-05-07T20:32:19.1679634Z torch.manual_seed(2025) 2025-05-07T20:32:19.1679710Z 2025-05-07T20:32:19.1679882Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1679955Z 2025-05-07T20:32:19.1680047Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1680183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1680273Z x = x_sign * x_clamp 2025-05-07T20:32:19.1680352Z x0 = x[:, :D] 2025-05-07T20:32:19.1680442Z x1 = x[:, D:] 2025-05-07T20:32:19.1680514Z 2025-05-07T20:32:19.1680602Z if contiguous: 2025-05-07T20:32:19.1680698Z x0 = x0.contiguous() 2025-05-07T20:32:19.1680788Z x1 = x1.contiguous() 2025-05-07T20:32:19.1680868Z 2025-05-07T20:32:19.1680959Z if scale_ub is not None: 2025-05-07T20:32:19.1681067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1681206Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1681285Z ) 2025-05-07T20:32:19.1681415Z else: 2025-05-07T20:32:19.1681518Z scale_ub_tensor = None 2025-05-07T20:32:19.1681590Z 2025-05-07T20:32:19.1681719Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1681815Z op = silu_mul_quant 2025-05-07T20:32:19.1681904Z if compiled: 2025-05-07T20:32:19.1682005Z op = torch.compile(op) 2025-05-07T20:32:19.1682119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1682193Z 2025-05-07T20:32:19.1682290Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1682339Z 2025-05-07T20:32:19.1682437Z moe/activation_test.py:117: 2025-05-07T20:32:19.1682566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1682673Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1682775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1683137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1683242Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1683730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1683834Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1684274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1684494Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1684842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1684936Z kernel = self.compile( 2025-05-07T20:32:19.1685315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1685496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1685621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1685628Z 2025-05-07T20:32:19.1685840Z self = 2025-05-07T20:32:19.1686605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1687104Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e4d60>} 2025-05-07T20:32:19.1687846Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1688037Z context = 2025-05-07T20:32:19.1688046Z 2025-05-07T20:32:19.1688216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1688476Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1688594Z module_map=module_map) 2025-05-07T20:32:19.1688755Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1688856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1688940Z E ^ 2025-05-07T20:32:19.1689290Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1689294Z 2025-05-07T20:32:19.1689703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1689707Z 2025-05-07T20:32:19.1689819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1690039Z self=, 2025-05-07T20:32:19.1690175Z T=1, 2025-05-07T20:32:19.1690253Z D=7168, 2025-05-07T20:32:19.1690335Z scale_ub=None, 2025-05-07T20:32:19.1690432Z contiguous=False, 2025-05-07T20:32:19.1690516Z compiled=False, 2025-05-07T20:32:19.1690588Z ) 2025-05-07T20:32:19.1690818Z self = 2025-05-07T20:32:19.1690983Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1690987Z 2025-05-07T20:32:19.1691109Z @given( 2025-05-07T20:32:19.1691240Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1691338Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1691459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1691581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1691694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1691775Z ) 2025-05-07T20:32:19.1692021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1692115Z def test_silu_mul_quant( 2025-05-07T20:32:19.1692203Z self, 2025-05-07T20:32:19.1692282Z T: int, 2025-05-07T20:32:19.1692358Z D: int, 2025-05-07T20:32:19.1692544Z scale_ub: Optional[float], 2025-05-07T20:32:19.1692639Z contiguous: bool, 2025-05-07T20:32:19.1692723Z compiled: bool, 2025-05-07T20:32:19.1692810Z ) -> None: 2025-05-07T20:32:19.1692906Z torch.manual_seed(2025) 2025-05-07T20:32:19.1692989Z 2025-05-07T20:32:19.1693157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1693231Z 2025-05-07T20:32:19.1693331Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1693456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1693545Z x = x_sign * x_clamp 2025-05-07T20:32:19.1693633Z x0 = x[:, :D] 2025-05-07T20:32:19.1693857Z x1 = x[:, D:] 2025-05-07T20:32:19.1693934Z 2025-05-07T20:32:19.1694024Z if contiguous: 2025-05-07T20:32:19.1694119Z x0 = x0.contiguous() 2025-05-07T20:32:19.1694208Z x1 = x1.contiguous() 2025-05-07T20:32:19.1694289Z 2025-05-07T20:32:19.1694381Z if scale_ub is not None: 2025-05-07T20:32:19.1694496Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1694630Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1694708Z ) 2025-05-07T20:32:19.1694794Z else: 2025-05-07T20:32:19.1694887Z scale_ub_tensor = None 2025-05-07T20:32:19.1694960Z 2025-05-07T20:32:19.1695096Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1695185Z op = silu_mul_quant 2025-05-07T20:32:19.1695271Z if compiled: 2025-05-07T20:32:19.1695376Z op = torch.compile(op) 2025-05-07T20:32:19.1695481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1695557Z 2025-05-07T20:32:19.1695655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1695660Z 2025-05-07T20:32:19.1695756Z moe/activation_test.py:117: 2025-05-07T20:32:19.1695889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1695994Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1696093Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1696588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1696689Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1697043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1697267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1697602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1697756Z kernel = self.compile( 2025-05-07T20:32:19.1698134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1699396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1699745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1699752Z 2025-05-07T20:32:19.1699984Z self = 2025-05-07T20:32:19.1701932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1702937Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e5760>} 2025-05-07T20:32:19.1704430Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1705072Z context = 2025-05-07T20:32:19.1705083Z 2025-05-07T20:32:19.1705413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1705938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1706161Z module_map=module_map) 2025-05-07T20:32:19.1706485Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1706690Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1706845Z E ^ 2025-05-07T20:32:19.1707550Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1707575Z 2025-05-07T20:32:19.1708391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1708401Z 2025-05-07T20:32:19.1708607Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1709061Z self=, 2025-05-07T20:32:19.1709218Z T=2048, 2025-05-07T20:32:19.1709370Z D=7168, 2025-05-07T20:32:19.1709543Z scale_ub=None, 2025-05-07T20:32:19.1709717Z contiguous=False, 2025-05-07T20:32:19.1709890Z compiled=True, 2025-05-07T20:32:19.1710048Z ) 2025-05-07T20:32:19.1710384Z self = 2025-05-07T20:32:19.1710588Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1710592Z 2025-05-07T20:32:19.1710688Z @given( 2025-05-07T20:32:19.1710815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1710927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1711044Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1711161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1711280Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1711363Z ) 2025-05-07T20:32:19.1711613Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1711708Z def test_silu_mul_quant( 2025-05-07T20:32:19.1711789Z self, 2025-05-07T20:32:19.1711878Z T: int, 2025-05-07T20:32:19.1711972Z D: int, 2025-05-07T20:32:19.1712071Z scale_ub: Optional[float], 2025-05-07T20:32:19.1712168Z contiguous: bool, 2025-05-07T20:32:19.1712254Z compiled: bool, 2025-05-07T20:32:19.1712336Z ) -> None: 2025-05-07T20:32:19.1712439Z torch.manual_seed(2025) 2025-05-07T20:32:19.1712512Z 2025-05-07T20:32:19.1712683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1712850Z 2025-05-07T20:32:19.1719113Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1719274Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1719370Z x = x_sign * x_clamp 2025-05-07T20:32:19.1719463Z x0 = x[:, :D] 2025-05-07T20:32:19.1719553Z x1 = x[:, D:] 2025-05-07T20:32:19.1719631Z 2025-05-07T20:32:19.1719727Z if contiguous: 2025-05-07T20:32:19.1719820Z x0 = x0.contiguous() 2025-05-07T20:32:19.1719910Z x1 = x1.contiguous() 2025-05-07T20:32:19.1720064Z 2025-05-07T20:32:19.1720156Z if scale_ub is not None: 2025-05-07T20:32:19.1720266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1720415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1720494Z ) 2025-05-07T20:32:19.1720572Z else: 2025-05-07T20:32:19.1720677Z scale_ub_tensor = None 2025-05-07T20:32:19.1720754Z 2025-05-07T20:32:19.1720895Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1720994Z op = silu_mul_quant 2025-05-07T20:32:19.1721080Z if compiled: 2025-05-07T20:32:19.1721188Z op = torch.compile(op) 2025-05-07T20:32:19.1721377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1721453Z 2025-05-07T20:32:19.1721554Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1721559Z 2025-05-07T20:32:19.1721658Z moe/activation_test.py:117: 2025-05-07T20:32:19.1721792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1721905Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1722009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1722378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1722481Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1722972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1724537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1724894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1725119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1725462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1725561Z kernel = self.compile( 2025-05-07T20:32:19.1725949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1726124Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1726255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1726260Z 2025-05-07T20:32:19.1726471Z self = 2025-05-07T20:32:19.1727248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1727757Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda48e6f20>} 2025-05-07T20:32:19.1728497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1728692Z context = 2025-05-07T20:32:19.1728697Z 2025-05-07T20:32:19.1728868Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1729176Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1729294Z module_map=module_map) 2025-05-07T20:32:19.1729457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1729562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1729646Z E ^ 2025-05-07T20:32:19.1729996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1730041Z 2025-05-07T20:32:19.1730455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1730460Z 2025-05-07T20:32:19.1730564Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1730784Z self=, 2025-05-07T20:32:19.1730871Z T=4096, 2025-05-07T20:32:19.1730949Z D=7168, 2025-05-07T20:32:19.1731035Z scale_ub=None, 2025-05-07T20:32:19.1731131Z contiguous=False, 2025-05-07T20:32:19.1731215Z compiled=True, 2025-05-07T20:32:19.1731292Z ) 2025-05-07T20:32:19.1731515Z self = 2025-05-07T20:32:19.1731768Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1731773Z 2025-05-07T20:32:19.1731860Z @given( 2025-05-07T20:32:19.1731980Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1732085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1732210Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1732332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1732446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1732532Z ) 2025-05-07T20:32:19.1732777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1732873Z def test_silu_mul_quant( 2025-05-07T20:32:19.1732966Z self, 2025-05-07T20:32:19.1733045Z T: int, 2025-05-07T20:32:19.1733133Z D: int, 2025-05-07T20:32:19.1733235Z scale_ub: Optional[float], 2025-05-07T20:32:19.1733325Z contiguous: bool, 2025-05-07T20:32:19.1733421Z compiled: bool, 2025-05-07T20:32:19.1733506Z ) -> None: 2025-05-07T20:32:19.1733603Z torch.manual_seed(2025) 2025-05-07T20:32:19.1733805Z 2025-05-07T20:32:19.1733974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1734064Z 2025-05-07T20:32:19.1734158Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1734283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1734381Z x = x_sign * x_clamp 2025-05-07T20:32:19.1734461Z x0 = x[:, :D] 2025-05-07T20:32:19.1734542Z x1 = x[:, D:] 2025-05-07T20:32:19.1734622Z 2025-05-07T20:32:19.1734708Z if contiguous: 2025-05-07T20:32:19.1734800Z x0 = x0.contiguous() 2025-05-07T20:32:19.1734899Z x1 = x1.contiguous() 2025-05-07T20:32:19.1734972Z 2025-05-07T20:32:19.1735064Z if scale_ub is not None: 2025-05-07T20:32:19.1735177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1735315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1735394Z ) 2025-05-07T20:32:19.1735480Z else: 2025-05-07T20:32:19.1735575Z scale_ub_tensor = None 2025-05-07T20:32:19.1735656Z 2025-05-07T20:32:19.1735789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1735881Z op = silu_mul_quant 2025-05-07T20:32:19.1735972Z if compiled: 2025-05-07T20:32:19.1736072Z op = torch.compile(op) 2025-05-07T20:32:19.1736179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1736262Z 2025-05-07T20:32:19.1736353Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1736358Z 2025-05-07T20:32:19.1736510Z moe/activation_test.py:117: 2025-05-07T20:32:19.1736649Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1736749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1736856Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1737224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1737318Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1737815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1737959Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1738312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1738538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1738875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1738978Z kernel = self.compile( 2025-05-07T20:32:19.1739358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1739637Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1739772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1739777Z 2025-05-07T20:32:19.1739982Z self = 2025-05-07T20:32:19.1740809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1741306Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df00e0>} 2025-05-07T20:32:19.1742043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1742245Z context = 2025-05-07T20:32:19.1742250Z 2025-05-07T20:32:19.1742412Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1742679Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1742791Z module_map=module_map) 2025-05-07T20:32:19.1742954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1743060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1743138Z E ^ 2025-05-07T20:32:19.1743495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1743502Z 2025-05-07T20:32:19.1743907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1743912Z 2025-05-07T20:32:19.1744021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1744248Z self=, 2025-05-07T20:32:19.1744328Z T=16384, 2025-05-07T20:32:19.1744407Z D=5120, 2025-05-07T20:32:19.1744503Z scale_ub=1200.0, 2025-05-07T20:32:19.1744593Z contiguous=False, 2025-05-07T20:32:19.1744686Z compiled=False, 2025-05-07T20:32:19.1744766Z ) 2025-05-07T20:32:19.1744983Z self = 2025-05-07T20:32:19.1745169Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1745173Z 2025-05-07T20:32:19.1745253Z @given( 2025-05-07T20:32:19.1745377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1745531Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1745649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1745769Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1745893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1745969Z ) 2025-05-07T20:32:19.1746220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1746313Z def test_silu_mul_quant( 2025-05-07T20:32:19.1746431Z self, 2025-05-07T20:32:19.1746516Z T: int, 2025-05-07T20:32:19.1746593Z D: int, 2025-05-07T20:32:19.1746692Z scale_ub: Optional[float], 2025-05-07T20:32:19.1746794Z contiguous: bool, 2025-05-07T20:32:19.1746882Z compiled: bool, 2025-05-07T20:32:19.1746962Z ) -> None: 2025-05-07T20:32:19.1747064Z torch.manual_seed(2025) 2025-05-07T20:32:19.1747137Z 2025-05-07T20:32:19.1747311Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1747393Z 2025-05-07T20:32:19.1747486Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1747619Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1747785Z x = x_sign * x_clamp 2025-05-07T20:32:19.1747867Z x0 = x[:, :D] 2025-05-07T20:32:19.1747961Z x1 = x[:, D:] 2025-05-07T20:32:19.1748035Z 2025-05-07T20:32:19.1748120Z if contiguous: 2025-05-07T20:32:19.1748219Z x0 = x0.contiguous() 2025-05-07T20:32:19.1748312Z x1 = x1.contiguous() 2025-05-07T20:32:19.1748386Z 2025-05-07T20:32:19.1748485Z if scale_ub is not None: 2025-05-07T20:32:19.1748593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1748727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1748811Z ) 2025-05-07T20:32:19.1748888Z else: 2025-05-07T20:32:19.1748983Z scale_ub_tensor = None 2025-05-07T20:32:19.1749068Z 2025-05-07T20:32:19.1749196Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1749295Z op = silu_mul_quant 2025-05-07T20:32:19.1749380Z if compiled: 2025-05-07T20:32:19.1749486Z op = torch.compile(op) 2025-05-07T20:32:19.1749601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1749674Z 2025-05-07T20:32:19.1749765Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1749770Z 2025-05-07T20:32:19.1749877Z moe/activation_test.py:117: 2025-05-07T20:32:19.1750008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1750108Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1750215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1750704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1750813Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1751171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1751393Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1751744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1751838Z kernel = self.compile( 2025-05-07T20:32:19.1752225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1752403Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1752532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1752536Z 2025-05-07T20:32:19.1752751Z self = 2025-05-07T20:32:19.1753512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1754071Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df0b80>} 2025-05-07T20:32:19.1754806Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1755038Z context = 2025-05-07T20:32:19.1755043Z 2025-05-07T20:32:19.1755215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1755471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1755591Z module_map=module_map) 2025-05-07T20:32:19.1755753Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1755851Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1755937Z E ^ 2025-05-07T20:32:19.1756358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1756363Z 2025-05-07T20:32:19.1756769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1756783Z 2025-05-07T20:32:19.1756888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1757109Z self=, 2025-05-07T20:32:19.1757194Z T=16384, 2025-05-07T20:32:19.1757272Z D=5120, 2025-05-07T20:32:19.1757358Z scale_ub=1200.0, 2025-05-07T20:32:19.1757450Z contiguous=True, 2025-05-07T20:32:19.1757535Z compiled=True, 2025-05-07T20:32:19.1757612Z ) 2025-05-07T20:32:19.1757835Z self = 2025-05-07T20:32:19.1758005Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1758009Z 2025-05-07T20:32:19.1758091Z @given( 2025-05-07T20:32:19.1758216Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1758315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1758438Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1758558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1758674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1758756Z ) 2025-05-07T20:32:19.1758998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1759092Z def test_silu_mul_quant( 2025-05-07T20:32:19.1759174Z self, 2025-05-07T20:32:19.1759252Z T: int, 2025-05-07T20:32:19.1759334Z D: int, 2025-05-07T20:32:19.1759438Z scale_ub: Optional[float], 2025-05-07T20:32:19.1759529Z contiguous: bool, 2025-05-07T20:32:19.1759623Z compiled: bool, 2025-05-07T20:32:19.1759702Z ) -> None: 2025-05-07T20:32:19.1759797Z torch.manual_seed(2025) 2025-05-07T20:32:19.1759881Z 2025-05-07T20:32:19.1760050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1760125Z 2025-05-07T20:32:19.1760223Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1760353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1760462Z x = x_sign * x_clamp 2025-05-07T20:32:19.1760562Z x0 = x[:, :D] 2025-05-07T20:32:19.1760661Z x1 = x[:, D:] 2025-05-07T20:32:19.1760737Z 2025-05-07T20:32:19.1760827Z if contiguous: 2025-05-07T20:32:19.1760919Z x0 = x0.contiguous() 2025-05-07T20:32:19.1761015Z x1 = x1.contiguous() 2025-05-07T20:32:19.1761087Z 2025-05-07T20:32:19.1761228Z if scale_ub is not None: 2025-05-07T20:32:19.1761339Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1761473Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1761550Z ) 2025-05-07T20:32:19.1761636Z else: 2025-05-07T20:32:19.1761733Z scale_ub_tensor = None 2025-05-07T20:32:19.1761807Z 2025-05-07T20:32:19.1761944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1762034Z op = silu_mul_quant 2025-05-07T20:32:19.1762164Z if compiled: 2025-05-07T20:32:19.1762274Z op = torch.compile(op) 2025-05-07T20:32:19.1762380Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1762461Z 2025-05-07T20:32:19.1762553Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1762557Z 2025-05-07T20:32:19.1762654Z moe/activation_test.py:117: 2025-05-07T20:32:19.1762790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1762894Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1762997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1763366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1763535Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1764030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1764128Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1764488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1764718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1765057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1765153Z kernel = self.compile( 2025-05-07T20:32:19.1765544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1765717Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1765857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1765862Z 2025-05-07T20:32:19.1766066Z self = 2025-05-07T20:32:19.1766828Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1767341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df22a0>} 2025-05-07T20:32:19.1768074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1768278Z context = 2025-05-07T20:32:19.1768282Z 2025-05-07T20:32:19.1768455Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1768714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1768833Z module_map=module_map) 2025-05-07T20:32:19.1768998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1769104Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1769186Z E ^ 2025-05-07T20:32:19.1769537Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1769542Z 2025-05-07T20:32:19.1769955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1770004Z 2025-05-07T20:32:19.1770111Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1770346Z self=, 2025-05-07T20:32:19.1770427Z T=16384, 2025-05-07T20:32:19.1770528Z D=5120, 2025-05-07T20:32:19.1770626Z scale_ub=None, 2025-05-07T20:32:19.1770735Z contiguous=False, 2025-05-07T20:32:19.1770821Z compiled=True, 2025-05-07T20:32:19.1770942Z ) 2025-05-07T20:32:19.1771159Z self = 2025-05-07T20:32:19.1771331Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1771336Z 2025-05-07T20:32:19.1771421Z @given( 2025-05-07T20:32:19.1771540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1771648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1771768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1771887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1772009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1772086Z ) 2025-05-07T20:32:19.1772428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1772531Z def test_silu_mul_quant( 2025-05-07T20:32:19.1772611Z self, 2025-05-07T20:32:19.1772689Z T: int, 2025-05-07T20:32:19.1772777Z D: int, 2025-05-07T20:32:19.1772875Z scale_ub: Optional[float], 2025-05-07T20:32:19.1772965Z contiguous: bool, 2025-05-07T20:32:19.1773058Z compiled: bool, 2025-05-07T20:32:19.1773136Z ) -> None: 2025-05-07T20:32:19.1773241Z torch.manual_seed(2025) 2025-05-07T20:32:19.1773313Z 2025-05-07T20:32:19.1773481Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1773562Z 2025-05-07T20:32:19.1773726Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1773852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1773947Z x = x_sign * x_clamp 2025-05-07T20:32:19.1774028Z x0 = x[:, :D] 2025-05-07T20:32:19.1774105Z x1 = x[:, D:] 2025-05-07T20:32:19.1774191Z 2025-05-07T20:32:19.1774272Z if contiguous: 2025-05-07T20:32:19.1774361Z x0 = x0.contiguous() 2025-05-07T20:32:19.1774455Z x1 = x1.contiguous() 2025-05-07T20:32:19.1774526Z 2025-05-07T20:32:19.1774626Z if scale_ub is not None: 2025-05-07T20:32:19.1774727Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1774859Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1774936Z ) 2025-05-07T20:32:19.1775009Z else: 2025-05-07T20:32:19.1775102Z scale_ub_tensor = None 2025-05-07T20:32:19.1775181Z 2025-05-07T20:32:19.1775311Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1775402Z op = silu_mul_quant 2025-05-07T20:32:19.1775489Z if compiled: 2025-05-07T20:32:19.1775589Z op = torch.compile(op) 2025-05-07T20:32:19.1775694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1775776Z 2025-05-07T20:32:19.1775870Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1775874Z 2025-05-07T20:32:19.1775975Z moe/activation_test.py:117: 2025-05-07T20:32:19.1776101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1776201Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1776304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1776660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1776752Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1777243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1777389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1777749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1777970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1778303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1778402Z kernel = self.compile( 2025-05-07T20:32:19.1778822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1778994Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1779127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1779131Z 2025-05-07T20:32:19.1779338Z self = 2025-05-07T20:32:19.1780101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1780721Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4df3060>} 2025-05-07T20:32:19.1781458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1781658Z context = 2025-05-07T20:32:19.1781662Z 2025-05-07T20:32:19.1781824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1782077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1782194Z module_map=module_map) 2025-05-07T20:32:19.1782353Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1782450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1782537Z E ^ 2025-05-07T20:32:19.1782884Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1782889Z 2025-05-07T20:32:19.1783299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1783306Z 2025-05-07T20:32:19.1783408Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1783623Z self=, 2025-05-07T20:32:19.1783709Z T=2048, 2025-05-07T20:32:19.1783781Z D=5120, 2025-05-07T20:32:19.1783860Z scale_ub=None, 2025-05-07T20:32:19.1783953Z contiguous=False, 2025-05-07T20:32:19.1784035Z compiled=True, 2025-05-07T20:32:19.1784113Z ) 2025-05-07T20:32:19.1784327Z self = 2025-05-07T20:32:19.1784496Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.1784506Z 2025-05-07T20:32:19.1784587Z @given( 2025-05-07T20:32:19.1784706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1784803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1784925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1785041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1785151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1785235Z ) 2025-05-07T20:32:19.1785475Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1785572Z def test_silu_mul_quant( 2025-05-07T20:32:19.1785647Z self, 2025-05-07T20:32:19.1785769Z T: int, 2025-05-07T20:32:19.1785848Z D: int, 2025-05-07T20:32:19.1785946Z scale_ub: Optional[float], 2025-05-07T20:32:19.1786032Z contiguous: bool, 2025-05-07T20:32:19.1786122Z compiled: bool, 2025-05-07T20:32:19.1786197Z ) -> None: 2025-05-07T20:32:19.1786292Z torch.manual_seed(2025) 2025-05-07T20:32:19.1786369Z 2025-05-07T20:32:19.1786532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1786602Z 2025-05-07T20:32:19.1786736Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1786856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1786948Z x = x_sign * x_clamp 2025-05-07T20:32:19.1787027Z x0 = x[:, :D] 2025-05-07T20:32:19.1787103Z x1 = x[:, D:] 2025-05-07T20:32:19.1787182Z 2025-05-07T20:32:19.1787262Z if contiguous: 2025-05-07T20:32:19.1787350Z x0 = x0.contiguous() 2025-05-07T20:32:19.1787442Z x1 = x1.contiguous() 2025-05-07T20:32:19.1787515Z 2025-05-07T20:32:19.1787602Z if scale_ub is not None: 2025-05-07T20:32:19.1787712Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1787842Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1787989Z ) 2025-05-07T20:32:19.1788074Z else: 2025-05-07T20:32:19.1788166Z scale_ub_tensor = None 2025-05-07T20:32:19.1788238Z 2025-05-07T20:32:19.1788373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1788465Z op = silu_mul_quant 2025-05-07T20:32:19.1788553Z if compiled: 2025-05-07T20:32:19.1788650Z op = torch.compile(op) 2025-05-07T20:32:19.1788754Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1788833Z 2025-05-07T20:32:19.1788922Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1788927Z 2025-05-07T20:32:19.1789020Z moe/activation_test.py:117: 2025-05-07T20:32:19.1789154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1789250Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1789344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1789715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1789807Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1790297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1790395Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1790795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1791017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1791350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1791450Z kernel = self.compile( 2025-05-07T20:32:19.1791826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1792000Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1792138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1792142Z 2025-05-07T20:32:19.1792344Z self = 2025-05-07T20:32:19.1793112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1793607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda51507c0>} 2025-05-07T20:32:19.1794380Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1794582Z context = 2025-05-07T20:32:19.1794587Z 2025-05-07T20:32:19.1794750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1795009Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1795155Z module_map=module_map) 2025-05-07T20:32:19.1795312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1795415Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1795493Z E ^ 2025-05-07T20:32:19.1795838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1795852Z 2025-05-07T20:32:19.1796256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1796261Z 2025-05-07T20:32:19.1796361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1796655Z self=, 2025-05-07T20:32:19.1796735Z T=2048, 2025-05-07T20:32:19.1796811Z D=5120, 2025-05-07T20:32:19.1796902Z scale_ub=1200.0, 2025-05-07T20:32:19.1796990Z contiguous=False, 2025-05-07T20:32:19.1797073Z compiled=True, 2025-05-07T20:32:19.1797154Z ) 2025-05-07T20:32:19.1797368Z self = 2025-05-07T20:32:19.1797544Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1797548Z 2025-05-07T20:32:19.1797622Z @given( 2025-05-07T20:32:19.1797743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1797849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1797961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1798075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1798423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1798541Z ) 2025-05-07T20:32:19.1798792Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1798892Z def test_silu_mul_quant( 2025-05-07T20:32:19.1798968Z self, 2025-05-07T20:32:19.1799050Z T: int, 2025-05-07T20:32:19.1799123Z D: int, 2025-05-07T20:32:19.1799218Z scale_ub: Optional[float], 2025-05-07T20:32:19.1799314Z contiguous: bool, 2025-05-07T20:32:19.1799395Z compiled: bool, 2025-05-07T20:32:19.1799656Z ) -> None: 2025-05-07T20:32:19.1799756Z torch.manual_seed(2025) 2025-05-07T20:32:19.1799827Z 2025-05-07T20:32:19.1799992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1800072Z 2025-05-07T20:32:19.1800160Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1800283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1800375Z x = x_sign * x_clamp 2025-05-07T20:32:19.1800451Z x0 = x[:, :D] 2025-05-07T20:32:19.1800537Z x1 = x[:, D:] 2025-05-07T20:32:19.1800609Z 2025-05-07T20:32:19.1800690Z if contiguous: 2025-05-07T20:32:19.1800783Z x0 = x0.contiguous() 2025-05-07T20:32:19.1800870Z x1 = x1.contiguous() 2025-05-07T20:32:19.1800940Z 2025-05-07T20:32:19.1801033Z if scale_ub is not None: 2025-05-07T20:32:19.1801137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1801265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1801345Z ) 2025-05-07T20:32:19.1801417Z else: 2025-05-07T20:32:19.1801507Z scale_ub_tensor = None 2025-05-07T20:32:19.1801581Z 2025-05-07T20:32:19.1801805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1801890Z op = silu_mul_quant 2025-05-07T20:32:19.1801977Z if compiled: 2025-05-07T20:32:19.1802073Z op = torch.compile(op) 2025-05-07T20:32:19.1802188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1802256Z 2025-05-07T20:32:19.1802345Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1802350Z 2025-05-07T20:32:19.1802448Z moe/activation_test.py:117: 2025-05-07T20:32:19.1802665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1802760Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1802864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1803223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1803319Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1803802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1803901Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1804256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1804586Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1804921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1805020Z kernel = self.compile( 2025-05-07T20:32:19.1805394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1805570Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1805695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1805699Z 2025-05-07T20:32:19.1805902Z self = 2025-05-07T20:32:19.1806673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1807167Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5151580>} 2025-05-07T20:32:19.1807904Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1808093Z context = 2025-05-07T20:32:19.1808098Z 2025-05-07T20:32:19.1808267Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1808522Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1808629Z module_map=module_map) 2025-05-07T20:32:19.1808792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1808892Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1808970Z E ^ 2025-05-07T20:32:19.1809321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1809329Z 2025-05-07T20:32:19.1809733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1809738Z 2025-05-07T20:32:19.1809843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1810060Z self=, 2025-05-07T20:32:19.1810133Z T=4096, 2025-05-07T20:32:19.1810212Z D=5120, 2025-05-07T20:32:19.1810341Z scale_ub=1200.0, 2025-05-07T20:32:19.1810424Z contiguous=True, 2025-05-07T20:32:19.1810523Z compiled=True, 2025-05-07T20:32:19.1810607Z ) 2025-05-07T20:32:19.1810848Z self = 2025-05-07T20:32:19.1811024Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1811029Z 2025-05-07T20:32:19.1811103Z @given( 2025-05-07T20:32:19.1811226Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1811364Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1811479Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1811600Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1811709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1811780Z ) 2025-05-07T20:32:19.1812026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1812121Z def test_silu_mul_quant( 2025-05-07T20:32:19.1812197Z self, 2025-05-07T20:32:19.1812279Z T: int, 2025-05-07T20:32:19.1812354Z D: int, 2025-05-07T20:32:19.1812455Z scale_ub: Optional[float], 2025-05-07T20:32:19.1812541Z contiguous: bool, 2025-05-07T20:32:19.1812704Z compiled: bool, 2025-05-07T20:32:19.1812785Z ) -> None: 2025-05-07T20:32:19.1812875Z torch.manual_seed(2025) 2025-05-07T20:32:19.1812943Z 2025-05-07T20:32:19.1813113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1813188Z 2025-05-07T20:32:19.1813277Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1813406Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1813492Z x = x_sign * x_clamp 2025-05-07T20:32:19.1813569Z x0 = x[:, :D] 2025-05-07T20:32:19.1813735Z x1 = x[:, D:] 2025-05-07T20:32:19.1813807Z 2025-05-07T20:32:19.1813895Z if contiguous: 2025-05-07T20:32:19.1813989Z x0 = x0.contiguous() 2025-05-07T20:32:19.1814074Z x1 = x1.contiguous() 2025-05-07T20:32:19.1814156Z 2025-05-07T20:32:19.1814244Z if scale_ub is not None: 2025-05-07T20:32:19.1814347Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1814490Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1814562Z ) 2025-05-07T20:32:19.1814637Z else: 2025-05-07T20:32:19.1814733Z scale_ub_tensor = None 2025-05-07T20:32:19.1814803Z 2025-05-07T20:32:19.1814932Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1815022Z op = silu_mul_quant 2025-05-07T20:32:19.1815104Z if compiled: 2025-05-07T20:32:19.1815201Z op = torch.compile(op) 2025-05-07T20:32:19.1815309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1815380Z 2025-05-07T20:32:19.1815474Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1815479Z 2025-05-07T20:32:19.1815579Z moe/activation_test.py:117: 2025-05-07T20:32:19.1815704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1815807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1815903Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1816267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1816364Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1816847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1816948Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1817300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1817518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1817854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1817999Z kernel = self.compile( 2025-05-07T20:32:19.1818376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1818556Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1818680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1818685Z 2025-05-07T20:32:19.1818937Z self = 2025-05-07T20:32:19.1819696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1820195Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda5152840>} 2025-05-07T20:32:19.1821063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1821255Z context = 2025-05-07T20:32:19.1821260Z 2025-05-07T20:32:19.1821432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1821692Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1821802Z module_map=module_map) 2025-05-07T20:32:19.1821963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1822061Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1822145Z E ^ 2025-05-07T20:32:19.1822489Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1822498Z 2025-05-07T20:32:19.1822901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1822912Z 2025-05-07T20:32:19.1823016Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1823234Z self=, 2025-05-07T20:32:19.1823317Z T=128, 2025-05-07T20:32:19.1823393Z D=5120, 2025-05-07T20:32:19.1823478Z scale_ub=1200.0, 2025-05-07T20:32:19.1823567Z contiguous=False, 2025-05-07T20:32:19.1823651Z compiled=True, 2025-05-07T20:32:19.1823727Z ) 2025-05-07T20:32:19.1823948Z self = 2025-05-07T20:32:19.1824114Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1824118Z 2025-05-07T20:32:19.1824199Z @given( 2025-05-07T20:32:19.1824319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1824416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1824537Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1824653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1824766Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1824845Z ) 2025-05-07T20:32:19.1825085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1825174Z def test_silu_mul_quant( 2025-05-07T20:32:19.1825258Z self, 2025-05-07T20:32:19.1825332Z T: int, 2025-05-07T20:32:19.1825405Z D: int, 2025-05-07T20:32:19.1825512Z scale_ub: Optional[float], 2025-05-07T20:32:19.1825599Z contiguous: bool, 2025-05-07T20:32:19.1825688Z compiled: bool, 2025-05-07T20:32:19.1825764Z ) -> None: 2025-05-07T20:32:19.1825856Z torch.manual_seed(2025) 2025-05-07T20:32:19.1825936Z 2025-05-07T20:32:19.1826151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1826225Z 2025-05-07T20:32:19.1826322Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1826444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1826533Z x = x_sign * x_clamp 2025-05-07T20:32:19.1826619Z x0 = x[:, :D] 2025-05-07T20:32:19.1826695Z x1 = x[:, D:] 2025-05-07T20:32:19.1826767Z 2025-05-07T20:32:19.1826855Z if contiguous: 2025-05-07T20:32:19.1826984Z x0 = x0.contiguous() 2025-05-07T20:32:19.1827073Z x1 = x1.contiguous() 2025-05-07T20:32:19.1827157Z 2025-05-07T20:32:19.1827247Z if scale_ub is not None: 2025-05-07T20:32:19.1827355Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1827485Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1827561Z ) 2025-05-07T20:32:19.1827642Z else: 2025-05-07T20:32:19.1827737Z scale_ub_tensor = None 2025-05-07T20:32:19.1827806Z 2025-05-07T20:32:19.1827939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1828028Z op = silu_mul_quant 2025-05-07T20:32:19.1828112Z if compiled: 2025-05-07T20:32:19.1828296Z op = torch.compile(op) 2025-05-07T20:32:19.1828401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1828473Z 2025-05-07T20:32:19.1828568Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1828573Z 2025-05-07T20:32:19.1828673Z moe/activation_test.py:117: 2025-05-07T20:32:19.1828806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1828904Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1829003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1829367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1829457Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1829942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1830043Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1830399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1830623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1830955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1831047Z kernel = self.compile( 2025-05-07T20:32:19.1831429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1831599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1831729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1831736Z 2025-05-07T20:32:19.1831939Z self = 2025-05-07T20:32:19.1832705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1833206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda51534c0>} 2025-05-07T20:32:19.1833936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1834133Z context = 2025-05-07T20:32:19.1834137Z 2025-05-07T20:32:19.1834371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1834626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1834736Z module_map=module_map) 2025-05-07T20:32:19.1834898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1834998Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1835071Z E ^ 2025-05-07T20:32:19.1835419Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1835464Z 2025-05-07T20:32:19.1835876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1835881Z 2025-05-07T20:32:19.1835984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1836208Z self=, 2025-05-07T20:32:19.1836289Z T=16384, 2025-05-07T20:32:19.1836365Z D=7168, 2025-05-07T20:32:19.1836452Z scale_ub=1200.0, 2025-05-07T20:32:19.1836535Z contiguous=True, 2025-05-07T20:32:19.1836617Z compiled=True, 2025-05-07T20:32:19.1836697Z ) 2025-05-07T20:32:19.1836983Z self = 2025-05-07T20:32:19.1837155Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1837160Z 2025-05-07T20:32:19.1837237Z @given( 2025-05-07T20:32:19.1837356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1837458Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1837569Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1837681Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1837794Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1837866Z ) 2025-05-07T20:32:19.1838105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1838202Z def test_silu_mul_quant( 2025-05-07T20:32:19.1838275Z self, 2025-05-07T20:32:19.1838349Z T: int, 2025-05-07T20:32:19.1838425Z D: int, 2025-05-07T20:32:19.1838520Z scale_ub: Optional[float], 2025-05-07T20:32:19.1838609Z contiguous: bool, 2025-05-07T20:32:19.1838696Z compiled: bool, 2025-05-07T20:32:19.1838770Z ) -> None: 2025-05-07T20:32:19.1838866Z torch.manual_seed(2025) 2025-05-07T20:32:19.1838935Z 2025-05-07T20:32:19.1839101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1839177Z 2025-05-07T20:32:19.1839269Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1839389Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1839479Z x = x_sign * x_clamp 2025-05-07T20:32:19.1839557Z x0 = x[:, :D] 2025-05-07T20:32:19.1839634Z x1 = x[:, D:] 2025-05-07T20:32:19.1839713Z 2025-05-07T20:32:19.1839796Z if contiguous: 2025-05-07T20:32:19.1839884Z x0 = x0.contiguous() 2025-05-07T20:32:19.1839972Z x1 = x1.contiguous() 2025-05-07T20:32:19.1844441Z 2025-05-07T20:32:19.1844559Z if scale_ub is not None: 2025-05-07T20:32:19.1844677Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1844813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1844893Z ) 2025-05-07T20:32:19.1844969Z else: 2025-05-07T20:32:19.1845067Z scale_ub_tensor = None 2025-05-07T20:32:19.1845141Z 2025-05-07T20:32:19.1845272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1845366Z op = silu_mul_quant 2025-05-07T20:32:19.1845451Z if compiled: 2025-05-07T20:32:19.1845550Z op = torch.compile(op) 2025-05-07T20:32:19.1845658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1845730Z 2025-05-07T20:32:19.1845890Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1845894Z 2025-05-07T20:32:19.1845996Z moe/activation_test.py:117: 2025-05-07T20:32:19.1846125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1846229Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1846333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1846703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1846800Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1847327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1847422Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1847778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1847994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1848339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1848429Z kernel = self.compile( 2025-05-07T20:32:19.1848881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1849059Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1849185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1849193Z 2025-05-07T20:32:19.1849400Z self = 2025-05-07T20:32:19.1850162Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1850658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4378c20>} 2025-05-07T20:32:19.1851399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1851584Z context = 2025-05-07T20:32:19.1851589Z 2025-05-07T20:32:19.1851753Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1852008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1852112Z module_map=module_map) 2025-05-07T20:32:19.1852274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1852366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1852438Z E ^ 2025-05-07T20:32:19.1852792Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1852797Z 2025-05-07T20:32:19.1853208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1853212Z 2025-05-07T20:32:19.1853320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1853536Z self=, 2025-05-07T20:32:19.1853611Z T=16384, 2025-05-07T20:32:19.1853765Z D=5120, 2025-05-07T20:32:19.1853846Z scale_ub=1200.0, 2025-05-07T20:32:19.1853934Z contiguous=True, 2025-05-07T20:32:19.1854015Z compiled=False, 2025-05-07T20:32:19.1854086Z ) 2025-05-07T20:32:19.1854307Z self = 2025-05-07T20:32:19.1854480Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1854531Z 2025-05-07T20:32:19.1854607Z @given( 2025-05-07T20:32:19.1854730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1854825Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1854935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1855059Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1855169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1855242Z ) 2025-05-07T20:32:19.1855481Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1855613Z def test_silu_mul_quant( 2025-05-07T20:32:19.1855699Z self, 2025-05-07T20:32:19.1855772Z T: int, 2025-05-07T20:32:19.1855847Z D: int, 2025-05-07T20:32:19.1855949Z scale_ub: Optional[float], 2025-05-07T20:32:19.1856033Z contiguous: bool, 2025-05-07T20:32:19.1856113Z compiled: bool, 2025-05-07T20:32:19.1856200Z ) -> None: 2025-05-07T20:32:19.1856295Z torch.manual_seed(2025) 2025-05-07T20:32:19.1856364Z 2025-05-07T20:32:19.1856535Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1856604Z 2025-05-07T20:32:19.1856699Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1856901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1856989Z x = x_sign * x_clamp 2025-05-07T20:32:19.1857074Z x0 = x[:, :D] 2025-05-07T20:32:19.1857150Z x1 = x[:, D:] 2025-05-07T20:32:19.1857220Z 2025-05-07T20:32:19.1857307Z if contiguous: 2025-05-07T20:32:19.1857393Z x0 = x0.contiguous() 2025-05-07T20:32:19.1857489Z x1 = x1.contiguous() 2025-05-07T20:32:19.1857561Z 2025-05-07T20:32:19.1857647Z if scale_ub is not None: 2025-05-07T20:32:19.1857752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1857884Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1857954Z ) 2025-05-07T20:32:19.1858035Z else: 2025-05-07T20:32:19.1858121Z scale_ub_tensor = None 2025-05-07T20:32:19.1858190Z 2025-05-07T20:32:19.1858318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1858402Z op = silu_mul_quant 2025-05-07T20:32:19.1858486Z if compiled: 2025-05-07T20:32:19.1858589Z op = torch.compile(op) 2025-05-07T20:32:19.1858691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1858763Z 2025-05-07T20:32:19.1858852Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1858857Z 2025-05-07T20:32:19.1858948Z moe/activation_test.py:117: 2025-05-07T20:32:19.1859080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1859178Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1859272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1859761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1859854Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1860203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1860431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1860762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1860856Z kernel = self.compile( 2025-05-07T20:32:19.1861233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1861400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1861529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1861534Z 2025-05-07T20:32:19.1861732Z self = 2025-05-07T20:32:19.1862543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1863040Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda4379580>} 2025-05-07T20:32:19.1863770Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1864000Z context = 2025-05-07T20:32:19.1864005Z 2025-05-07T20:32:19.1864161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1864419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1864525Z module_map=module_map) 2025-05-07T20:32:19.1864680Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1864782Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1864952Z E ^ 2025-05-07T20:32:19.1865304Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1865308Z 2025-05-07T20:32:19.1865710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1865715Z 2025-05-07T20:32:19.1865810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1866030Z self=, 2025-05-07T20:32:19.1866102Z T=1, 2025-05-07T20:32:19.1866174Z D=7168, 2025-05-07T20:32:19.1866252Z scale_ub=1200.0, 2025-05-07T20:32:19.1866335Z contiguous=False, 2025-05-07T20:32:19.1866415Z compiled=False, 2025-05-07T20:32:19.1866482Z ) 2025-05-07T20:32:19.1866695Z self = 2025-05-07T20:32:19.1866866Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.1866871Z 2025-05-07T20:32:19.1869619Z @given( 2025-05-07T20:32:19.1869759Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1869864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1869984Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1870099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1870217Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1870291Z ) 2025-05-07T20:32:19.1870533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1870630Z def test_silu_mul_quant( 2025-05-07T20:32:19.1870708Z self, 2025-05-07T20:32:19.1870785Z T: int, 2025-05-07T20:32:19.1870865Z D: int, 2025-05-07T20:32:19.1870964Z scale_ub: Optional[float], 2025-05-07T20:32:19.1871056Z contiguous: bool, 2025-05-07T20:32:19.1871137Z compiled: bool, 2025-05-07T20:32:19.1871217Z ) -> None: 2025-05-07T20:32:19.1871312Z torch.manual_seed(2025) 2025-05-07T20:32:19.1871407Z 2025-05-07T20:32:19.1871575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1871651Z 2025-05-07T20:32:19.1871743Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1871868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1871958Z x = x_sign * x_clamp 2025-05-07T20:32:19.1872038Z x0 = x[:, :D] 2025-05-07T20:32:19.1872118Z x1 = x[:, D:] 2025-05-07T20:32:19.1872194Z 2025-05-07T20:32:19.1872277Z if contiguous: 2025-05-07T20:32:19.1872367Z x0 = x0.contiguous() 2025-05-07T20:32:19.1872518Z x1 = x1.contiguous() 2025-05-07T20:32:19.1872588Z 2025-05-07T20:32:19.1872678Z if scale_ub is not None: 2025-05-07T20:32:19.1872789Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1872920Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1873005Z ) 2025-05-07T20:32:19.1873080Z else: 2025-05-07T20:32:19.1873173Z scale_ub_tensor = None 2025-05-07T20:32:19.1873248Z 2025-05-07T20:32:19.1873376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1873507Z op = silu_mul_quant 2025-05-07T20:32:19.1873595Z if compiled: 2025-05-07T20:32:19.1873692Z op = torch.compile(op) 2025-05-07T20:32:19.1873797Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1873873Z 2025-05-07T20:32:19.1873960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1873965Z 2025-05-07T20:32:19.1874060Z moe/activation_test.py:117: 2025-05-07T20:32:19.1874196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1874295Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1874398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1874933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1875031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1875390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1875612Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1875945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1876043Z kernel = self.compile( 2025-05-07T20:32:19.1876423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1876600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1876725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1876729Z 2025-05-07T20:32:19.1876933Z self = 2025-05-07T20:32:19.1877785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1878287Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda437a8e0>} 2025-05-07T20:32:19.1879021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1879213Z context = 2025-05-07T20:32:19.1879217Z 2025-05-07T20:32:19.1879385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1879645Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1879753Z module_map=module_map) 2025-05-07T20:32:19.1879919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1880018Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1880093Z E ^ 2025-05-07T20:32:19.1880446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1880451Z 2025-05-07T20:32:19.1880853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1880901Z 2025-05-07T20:32:19.1881008Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1881226Z self=, 2025-05-07T20:32:19.1881300Z T=4096, 2025-05-07T20:32:19.1881380Z D=7168, 2025-05-07T20:32:19.1881464Z scale_ub=1200.0, 2025-05-07T20:32:19.1881546Z contiguous=False, 2025-05-07T20:32:19.1881634Z compiled=True, 2025-05-07T20:32:19.1881704Z ) 2025-05-07T20:32:19.1881917Z self = 2025-05-07T20:32:19.1882134Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1882139Z 2025-05-07T20:32:19.1882216Z @given( 2025-05-07T20:32:19.1882337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1882435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1882551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1882669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1882783Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1882857Z ) 2025-05-07T20:32:19.1883099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1883231Z def test_silu_mul_quant( 2025-05-07T20:32:19.1883312Z self, 2025-05-07T20:32:19.1883389Z T: int, 2025-05-07T20:32:19.1883465Z D: int, 2025-05-07T20:32:19.1883567Z scale_ub: Optional[float], 2025-05-07T20:32:19.1883660Z contiguous: bool, 2025-05-07T20:32:19.1883747Z compiled: bool, 2025-05-07T20:32:19.1883827Z ) -> None: 2025-05-07T20:32:19.1883919Z torch.manual_seed(2025) 2025-05-07T20:32:19.1883991Z 2025-05-07T20:32:19.1884163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1884235Z 2025-05-07T20:32:19.1884326Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1884454Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1884545Z x = x_sign * x_clamp 2025-05-07T20:32:19.1884623Z x0 = x[:, :D] 2025-05-07T20:32:19.1884706Z x1 = x[:, D:] 2025-05-07T20:32:19.1884777Z 2025-05-07T20:32:19.1884863Z if contiguous: 2025-05-07T20:32:19.1884954Z x0 = x0.contiguous() 2025-05-07T20:32:19.1885041Z x1 = x1.contiguous() 2025-05-07T20:32:19.1885173Z 2025-05-07T20:32:19.1885263Z if scale_ub is not None: 2025-05-07T20:32:19.1885366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1885503Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1885578Z ) 2025-05-07T20:32:19.1885651Z else: 2025-05-07T20:32:19.1885745Z scale_ub_tensor = None 2025-05-07T20:32:19.1885816Z 2025-05-07T20:32:19.1885941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1886034Z op = silu_mul_quant 2025-05-07T20:32:19.1886118Z if compiled: 2025-05-07T20:32:19.1886227Z op = torch.compile(op) 2025-05-07T20:32:19.1886329Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1886404Z 2025-05-07T20:32:19.1886496Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1886500Z 2025-05-07T20:32:19.1886596Z moe/activation_test.py:117: 2025-05-07T20:32:19.1886723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1886826Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1886923Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1887284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1887380Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1887863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1887964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1888361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1888579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1888923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1889018Z kernel = self.compile( 2025-05-07T20:32:19.1889399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1889609Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1889733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1889737Z 2025-05-07T20:32:19.1889946Z self = 2025-05-07T20:32:19.1890709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1891252Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda437ba60>} 2025-05-07T20:32:19.1891986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1892179Z context = 2025-05-07T20:32:19.1892183Z 2025-05-07T20:32:19.1892349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1892605Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1892718Z module_map=module_map) 2025-05-07T20:32:19.1892880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1892977Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1893057Z E ^ 2025-05-07T20:32:19.1893404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1893409Z 2025-05-07T20:32:19.1893938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1893951Z 2025-05-07T20:32:19.1894053Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1894270Z self=, 2025-05-07T20:32:19.1894348Z T=128, 2025-05-07T20:32:19.1894423Z D=7168, 2025-05-07T20:32:19.1894505Z scale_ub=1200.0, 2025-05-07T20:32:19.1894594Z contiguous=False, 2025-05-07T20:32:19.1894677Z compiled=True, 2025-05-07T20:32:19.1894747Z ) 2025-05-07T20:32:19.1894966Z self = 2025-05-07T20:32:19.1895132Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:19.1895136Z 2025-05-07T20:32:19.1895217Z @given( 2025-05-07T20:32:19.1895338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1895437Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1895557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1895672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1895784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1895862Z ) 2025-05-07T20:32:19.1896100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1896192Z def test_silu_mul_quant( 2025-05-07T20:32:19.1896271Z self, 2025-05-07T20:32:19.1896348Z T: int, 2025-05-07T20:32:19.1896424Z D: int, 2025-05-07T20:32:19.1896569Z scale_ub: Optional[float], 2025-05-07T20:32:19.1896658Z contiguous: bool, 2025-05-07T20:32:19.1896744Z compiled: bool, 2025-05-07T20:32:19.1896820Z ) -> None: 2025-05-07T20:32:19.1896912Z torch.manual_seed(2025) 2025-05-07T20:32:19.1896987Z 2025-05-07T20:32:19.1897154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1897228Z 2025-05-07T20:32:19.1897323Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1897445Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1897599Z x = x_sign * x_clamp 2025-05-07T20:32:19.1897683Z x0 = x[:, :D] 2025-05-07T20:32:19.1897761Z x1 = x[:, D:] 2025-05-07T20:32:19.1897832Z 2025-05-07T20:32:19.1897917Z if contiguous: 2025-05-07T20:32:19.1898007Z x0 = x0.contiguous() 2025-05-07T20:32:19.1898098Z x1 = x1.contiguous() 2025-05-07T20:32:19.1898811Z 2025-05-07T20:32:19.1898956Z if scale_ub is not None: 2025-05-07T20:32:19.1899110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1899248Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1899322Z ) 2025-05-07T20:32:19.1899401Z else: 2025-05-07T20:32:19.1899592Z scale_ub_tensor = None 2025-05-07T20:32:19.1899666Z 2025-05-07T20:32:19.1899800Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1899890Z op = silu_mul_quant 2025-05-07T20:32:19.1899972Z if compiled: 2025-05-07T20:32:19.1900078Z op = torch.compile(op) 2025-05-07T20:32:19.1900181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1900259Z 2025-05-07T20:32:19.1900348Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1900353Z 2025-05-07T20:32:19.1900448Z moe/activation_test.py:117: 2025-05-07T20:32:19.1900580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1900682Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1900780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1901149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1901241Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1901802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1901900Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1902254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1902472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1902804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1902895Z kernel = self.compile( 2025-05-07T20:32:19.1903276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1903448Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1903578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1903585Z 2025-05-07T20:32:19.1903793Z self = 2025-05-07T20:32:19.1904554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1905061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda445cea0>} 2025-05-07T20:32:19.1905790Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1906049Z context = 2025-05-07T20:32:19.1906054Z 2025-05-07T20:32:19.1906218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1906478Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1906587Z module_map=module_map) 2025-05-07T20:32:19.1906816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1906913Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1906991Z E ^ 2025-05-07T20:32:19.1907341Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1907345Z 2025-05-07T20:32:19.1907749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1907757Z 2025-05-07T20:32:19.1907863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1908078Z self=, 2025-05-07T20:32:19.1908155Z T=2048, 2025-05-07T20:32:19.1908278Z D=7168, 2025-05-07T20:32:19.1908364Z scale_ub=None, 2025-05-07T20:32:19.1908454Z contiguous=True, 2025-05-07T20:32:19.1908538Z compiled=True, 2025-05-07T20:32:19.1908612Z ) 2025-05-07T20:32:19.1908837Z self = 2025-05-07T20:32:19.1909001Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.1909006Z 2025-05-07T20:32:19.1909083Z @given( 2025-05-07T20:32:19.1909207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1909305Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1909419Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1909541Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1909653Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1909730Z ) 2025-05-07T20:32:19.1909973Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1910066Z def test_silu_mul_quant( 2025-05-07T20:32:19.1910190Z self, 2025-05-07T20:32:19.1910269Z T: int, 2025-05-07T20:32:19.1910348Z D: int, 2025-05-07T20:32:19.1910452Z scale_ub: Optional[float], 2025-05-07T20:32:19.1910542Z contiguous: bool, 2025-05-07T20:32:19.1910627Z compiled: bool, 2025-05-07T20:32:19.1910706Z ) -> None: 2025-05-07T20:32:19.1910798Z torch.manual_seed(2025) 2025-05-07T20:32:19.1910868Z 2025-05-07T20:32:19.1911038Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1911109Z 2025-05-07T20:32:19.1911211Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1911336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1911423Z x = x_sign * x_clamp 2025-05-07T20:32:19.1911507Z x0 = x[:, :D] 2025-05-07T20:32:19.1911584Z x1 = x[:, D:] 2025-05-07T20:32:19.1911655Z 2025-05-07T20:32:19.1911745Z if contiguous: 2025-05-07T20:32:19.1911833Z x0 = x0.contiguous() 2025-05-07T20:32:19.1911923Z x1 = x1.contiguous() 2025-05-07T20:32:19.1912001Z 2025-05-07T20:32:19.1912088Z if scale_ub is not None: 2025-05-07T20:32:19.1912196Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1912332Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1912408Z ) 2025-05-07T20:32:19.1912485Z else: 2025-05-07T20:32:19.1912582Z scale_ub_tensor = None 2025-05-07T20:32:19.1912654Z 2025-05-07T20:32:19.1912786Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1912874Z op = silu_mul_quant 2025-05-07T20:32:19.1913005Z if compiled: 2025-05-07T20:32:19.1913108Z op = torch.compile(op) 2025-05-07T20:32:19.1913212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1913282Z 2025-05-07T20:32:19.1913377Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1913383Z 2025-05-07T20:32:19.1913480Z moe/activation_test.py:117: 2025-05-07T20:32:19.1913605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1913708Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1913850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1914216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.1914307Z return fn(*args, **kwargs) 2025-05-07T20:32:19.1914792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1914894Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1915244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1915462Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1915840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1915933Z kernel = self.compile( 2025-05-07T20:32:19.1916313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1916488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1916614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1916619Z 2025-05-07T20:32:19.1916825Z self = 2025-05-07T20:32:19.1917586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1918144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda445dc60>} 2025-05-07T20:32:19.1918873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1919072Z context = 2025-05-07T20:32:19.1919077Z 2025-05-07T20:32:19.1919237Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1919492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1919607Z module_map=module_map) 2025-05-07T20:32:19.1919769Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1919864Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1919946Z E ^ 2025-05-07T20:32:19.1920297Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1920301Z 2025-05-07T20:32:19.1920711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1920718Z 2025-05-07T20:32:19.1920820Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1921036Z self=, 2025-05-07T20:32:19.1921117Z T=16384, 2025-05-07T20:32:19.1921194Z D=5120, 2025-05-07T20:32:19.1921276Z scale_ub=None, 2025-05-07T20:32:19.1921369Z contiguous=False, 2025-05-07T20:32:19.1921499Z compiled=False, 2025-05-07T20:32:19.1921577Z ) 2025-05-07T20:32:19.1921790Z self = 2025-05-07T20:32:19.1921960Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1921965Z 2025-05-07T20:32:19.1922049Z @given( 2025-05-07T20:32:19.1922173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1922271Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1922389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1922546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1922657Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1922737Z ) 2025-05-07T20:32:19.1922977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1923070Z def test_silu_mul_quant( 2025-05-07T20:32:19.1923146Z self, 2025-05-07T20:32:19.1923220Z T: int, 2025-05-07T20:32:19.1923304Z D: int, 2025-05-07T20:32:19.1923400Z scale_ub: Optional[float], 2025-05-07T20:32:19.1923487Z contiguous: bool, 2025-05-07T20:32:19.1923572Z compiled: bool, 2025-05-07T20:32:19.1923647Z ) -> None: 2025-05-07T20:32:19.1923780Z torch.manual_seed(2025) 2025-05-07T20:32:19.1923859Z 2025-05-07T20:32:19.1924026Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1924100Z 2025-05-07T20:32:19.1924195Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1924322Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1926079Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1926088Z 2025-05-07T20:32:19.1926206Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.1926210Z 2025-05-07T20:32:19.1926358Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1926575Z self=, 2025-05-07T20:32:19.1926653Z T=4096, 2025-05-07T20:32:19.1926729Z D=7168, 2025-05-07T20:32:19.1926811Z scale_ub=1200.0, 2025-05-07T20:32:19.1926893Z contiguous=True, 2025-05-07T20:32:19.1926978Z compiled=True, 2025-05-07T20:32:19.1927048Z ) 2025-05-07T20:32:19.1927258Z self = 2025-05-07T20:32:19.1927428Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1927435Z 2025-05-07T20:32:19.1927510Z @given( 2025-05-07T20:32:19.1927629Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1927727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1927836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1927961Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1928073Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1928146Z ) 2025-05-07T20:32:19.1928388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1928483Z def test_silu_mul_quant( 2025-05-07T20:32:19.1928558Z self, 2025-05-07T20:32:19.1928635Z T: int, 2025-05-07T20:32:19.1928712Z D: int, 2025-05-07T20:32:19.1928812Z scale_ub: Optional[float], 2025-05-07T20:32:19.1928898Z contiguous: bool, 2025-05-07T20:32:19.1928981Z compiled: bool, 2025-05-07T20:32:19.1929059Z ) -> None: 2025-05-07T20:32:19.1929223Z torch.manual_seed(2025) 2025-05-07T20:32:19.1929293Z 2025-05-07T20:32:19.1929460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1929532Z 2025-05-07T20:32:19.1929624Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1929753Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1931489Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1931540Z 2025-05-07T20:32:19.1931665Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.1931672Z 2025-05-07T20:32:19.1931769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1931987Z self=, 2025-05-07T20:32:19.1932062Z T=16384, 2025-05-07T20:32:19.1932139Z D=7168, 2025-05-07T20:32:19.1932264Z scale_ub=None, 2025-05-07T20:32:19.1932351Z contiguous=False, 2025-05-07T20:32:19.1932432Z compiled=False, 2025-05-07T20:32:19.1932506Z ) 2025-05-07T20:32:19.1932716Z self = 2025-05-07T20:32:19.1932888Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.1932892Z 2025-05-07T20:32:19.1932971Z @given( 2025-05-07T20:32:19.1933086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1933184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1933298Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1933412Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1933528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1933601Z ) 2025-05-07T20:32:19.1933938Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1934035Z def test_silu_mul_quant( 2025-05-07T20:32:19.1934111Z self, 2025-05-07T20:32:19.1934235Z T: int, 2025-05-07T20:32:19.1934315Z D: int, 2025-05-07T20:32:19.1934408Z scale_ub: Optional[float], 2025-05-07T20:32:19.1934497Z contiguous: bool, 2025-05-07T20:32:19.1934582Z compiled: bool, 2025-05-07T20:32:19.1934659Z ) -> None: 2025-05-07T20:32:19.1934757Z torch.manual_seed(2025) 2025-05-07T20:32:19.1934829Z 2025-05-07T20:32:19.1934993Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1936729Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1936737Z 2025-05-07T20:32:19.1936849Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.1936855Z 2025-05-07T20:32:19.1936956Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1937171Z self=, 2025-05-07T20:32:19.1937244Z T=2048, 2025-05-07T20:32:19.1937322Z D=7168, 2025-05-07T20:32:19.1937402Z scale_ub=1200.0, 2025-05-07T20:32:19.1937483Z contiguous=True, 2025-05-07T20:32:19.1937568Z compiled=True, 2025-05-07T20:32:19.1937690Z ) 2025-05-07T20:32:19.1937906Z self = 2025-05-07T20:32:19.1938070Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.1938075Z 2025-05-07T20:32:19.1938148Z @given( 2025-05-07T20:32:19.1938268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1938369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1938482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1938640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1938751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1938822Z ) 2025-05-07T20:32:19.1939065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1939158Z def test_silu_mul_quant( 2025-05-07T20:32:19.1939238Z self, 2025-05-07T20:32:19.1939313Z T: int, 2025-05-07T20:32:19.1939391Z D: int, 2025-05-07T20:32:19.1939489Z scale_ub: Optional[float], 2025-05-07T20:32:19.1939577Z contiguous: bool, 2025-05-07T20:32:19.1939660Z compiled: bool, 2025-05-07T20:32:19.1939739Z ) -> None: 2025-05-07T20:32:19.1939831Z torch.manual_seed(2025) 2025-05-07T20:32:19.1939942Z 2025-05-07T20:32:19.1940114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1940183Z 2025-05-07T20:32:19.1940273Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1940403Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1942117Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1942124Z 2025-05-07T20:32:19.1942242Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.1942249Z 2025-05-07T20:32:19.1942345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1942605Z self=, 2025-05-07T20:32:19.1942681Z T=2048, 2025-05-07T20:32:19.1942758Z D=7168, 2025-05-07T20:32:19.1942842Z scale_ub=None, 2025-05-07T20:32:19.1942924Z contiguous=True, 2025-05-07T20:32:19.1943004Z compiled=False, 2025-05-07T20:32:19.1943078Z ) 2025-05-07T20:32:19.1943289Z self = 2025-05-07T20:32:19.1943452Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.1943460Z 2025-05-07T20:32:19.1943538Z @given( 2025-05-07T20:32:19.1943653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1943753Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1943865Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1943982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1944098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1944170Z ) 2025-05-07T20:32:19.1944409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1944510Z def test_silu_mul_quant( 2025-05-07T20:32:19.1944584Z self, 2025-05-07T20:32:19.1944658Z T: int, 2025-05-07T20:32:19.1944736Z D: int, 2025-05-07T20:32:19.1944831Z scale_ub: Optional[float], 2025-05-07T20:32:19.1944921Z contiguous: bool, 2025-05-07T20:32:19.1945005Z compiled: bool, 2025-05-07T20:32:19.1945079Z ) -> None: 2025-05-07T20:32:19.1945177Z torch.manual_seed(2025) 2025-05-07T20:32:19.1945292Z 2025-05-07T20:32:19.1945454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1945527Z 2025-05-07T20:32:19.1945615Z > x_sign = torch.sign(x) 2025-05-07T20:32:19.1947341Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1947385Z 2025-05-07T20:32:19.1947501Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:19.1947506Z 2025-05-07T20:32:19.1947606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1947830Z self=, 2025-05-07T20:32:19.1947905Z T=1, 2025-05-07T20:32:19.1947982Z D=7168, 2025-05-07T20:32:19.1948062Z scale_ub=1200.0, 2025-05-07T20:32:19.1948144Z contiguous=True, 2025-05-07T20:32:19.1948266Z compiled=False, 2025-05-07T20:32:19.1948337Z ) 2025-05-07T20:32:19.1948550Z self = 2025-05-07T20:32:19.1948715Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1948722Z 2025-05-07T20:32:19.1948797Z @given( 2025-05-07T20:32:19.1948912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1949014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1949124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1949240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1949352Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1949428Z ) 2025-05-07T20:32:19.1949669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1949761Z def test_silu_mul_quant( 2025-05-07T20:32:19.1949836Z self, 2025-05-07T20:32:19.1949912Z T: int, 2025-05-07T20:32:19.1949988Z D: int, 2025-05-07T20:32:19.1950128Z scale_ub: Optional[float], 2025-05-07T20:32:19.1950221Z contiguous: bool, 2025-05-07T20:32:19.1950304Z compiled: bool, 2025-05-07T20:32:19.1950381Z ) -> None: 2025-05-07T20:32:19.1950476Z torch.manual_seed(2025) 2025-05-07T20:32:19.1950546Z 2025-05-07T20:32:19.1950715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1950787Z 2025-05-07T20:32:19.1950876Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1951002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1951087Z x = x_sign * x_clamp 2025-05-07T20:32:19.1951176Z x0 = x[:, :D] 2025-05-07T20:32:19.1951257Z x1 = x[:, D:] 2025-05-07T20:32:19.1951328Z 2025-05-07T20:32:19.1951410Z if contiguous: 2025-05-07T20:32:19.1951502Z x0 = x0.contiguous() 2025-05-07T20:32:19.1951589Z x1 = x1.contiguous() 2025-05-07T20:32:19.1951662Z 2025-05-07T20:32:19.1951755Z if scale_ub is not None: 2025-05-07T20:32:19.1951861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1951992Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1952070Z ) 2025-05-07T20:32:19.1952146Z else: 2025-05-07T20:32:19.1952242Z scale_ub_tensor = None 2025-05-07T20:32:19.1952311Z 2025-05-07T20:32:19.1952438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1952531Z op = silu_mul_quant 2025-05-07T20:32:19.1952614Z if compiled: 2025-05-07T20:32:19.1952711Z op = torch.compile(op) 2025-05-07T20:32:19.1952861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1952933Z 2025-05-07T20:32:19.1953020Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1953024Z 2025-05-07T20:32:19.1953120Z moe/activation_test.py:117: 2025-05-07T20:32:19.1953248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1953354Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1953452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1953945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1954085Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1954440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1954657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1954995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1955088Z kernel = self.compile( 2025-05-07T20:32:19.1955469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1955677Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1955804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1955809Z 2025-05-07T20:32:19.1956016Z self = 2025-05-07T20:32:19.1956779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1957281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e4b80>} 2025-05-07T20:32:19.1958017Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1958272Z context = 2025-05-07T20:32:19.1958282Z 2025-05-07T20:32:19.1958442Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1958703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1958814Z module_map=module_map) 2025-05-07T20:32:19.1958972Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1959067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1959145Z E ^ 2025-05-07T20:32:19.1959490Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1959497Z 2025-05-07T20:32:19.1959908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1959912Z 2025-05-07T20:32:19.1960018Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1960237Z self=, 2025-05-07T20:32:19.1960318Z T=128, 2025-05-07T20:32:19.1960408Z D=5120, 2025-05-07T20:32:19.1960500Z scale_ub=None, 2025-05-07T20:32:19.1960605Z contiguous=True, 2025-05-07T20:32:19.1960693Z compiled=False, 2025-05-07T20:32:19.1960762Z ) 2025-05-07T20:32:19.1960981Z self = 2025-05-07T20:32:19.1961144Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.1961148Z 2025-05-07T20:32:19.1961226Z @given( 2025-05-07T20:32:19.1961386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1961484Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1961599Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1961713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1961830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1961906Z ) 2025-05-07T20:32:19.1962146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1962239Z def test_silu_mul_quant( 2025-05-07T20:32:19.1962357Z self, 2025-05-07T20:32:19.1962431Z T: int, 2025-05-07T20:32:19.1962508Z D: int, 2025-05-07T20:32:19.1962603Z scale_ub: Optional[float], 2025-05-07T20:32:19.1962689Z contiguous: bool, 2025-05-07T20:32:19.1962775Z compiled: bool, 2025-05-07T20:32:19.1962853Z ) -> None: 2025-05-07T20:32:19.1962945Z torch.manual_seed(2025) 2025-05-07T20:32:19.1963018Z 2025-05-07T20:32:19.1963183Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1963254Z 2025-05-07T20:32:19.1963348Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1963470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1963593Z x = x_sign * x_clamp 2025-05-07T20:32:19.1963677Z x0 = x[:, :D] 2025-05-07T20:32:19.1963757Z x1 = x[:, D:] 2025-05-07T20:32:19.1963831Z 2025-05-07T20:32:19.1963912Z if contiguous: 2025-05-07T20:32:19.1964000Z x0 = x0.contiguous() 2025-05-07T20:32:19.1964095Z x1 = x1.contiguous() 2025-05-07T20:32:19.1964169Z 2025-05-07T20:32:19.1964256Z if scale_ub is not None: 2025-05-07T20:32:19.1964361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1964491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1964564Z ) 2025-05-07T20:32:19.1964642Z else: 2025-05-07T20:32:19.1964734Z scale_ub_tensor = None 2025-05-07T20:32:19.1964807Z 2025-05-07T20:32:19.1964936Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1969179Z op = silu_mul_quant 2025-05-07T20:32:19.1969287Z if compiled: 2025-05-07T20:32:19.1969395Z op = torch.compile(op) 2025-05-07T20:32:19.1969577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1969649Z 2025-05-07T20:32:19.1969738Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1969743Z 2025-05-07T20:32:19.1969843Z moe/activation_test.py:117: 2025-05-07T20:32:19.1969972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1970070Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1970171Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1970668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1970767Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1971122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1971339Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1971680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1971775Z kernel = self.compile( 2025-05-07T20:32:19.1972150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1972327Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1972453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1972457Z 2025-05-07T20:32:19.1972669Z self = 2025-05-07T20:32:19.1973433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1974091Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e5a80>} 2025-05-07T20:32:19.1974827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1975060Z context = 2025-05-07T20:32:19.1975064Z 2025-05-07T20:32:19.1975229Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1975486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1975599Z module_map=module_map) 2025-05-07T20:32:19.1975760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1975856Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1975934Z E ^ 2025-05-07T20:32:19.1976326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1976332Z 2025-05-07T20:32:19.1976736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1976743Z 2025-05-07T20:32:19.1976846Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1977063Z self=, 2025-05-07T20:32:19.1977140Z T=128, 2025-05-07T20:32:19.1977216Z D=7168, 2025-05-07T20:32:19.1977295Z scale_ub=None, 2025-05-07T20:32:19.1977379Z contiguous=True, 2025-05-07T20:32:19.1977460Z compiled=False, 2025-05-07T20:32:19.1977533Z ) 2025-05-07T20:32:19.1977749Z self = 2025-05-07T20:32:19.1977912Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.1977917Z 2025-05-07T20:32:19.1977991Z @given( 2025-05-07T20:32:19.1978113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1978255Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1978371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1978493Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1978604Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1978681Z ) 2025-05-07T20:32:19.1978920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1979016Z def test_silu_mul_quant( 2025-05-07T20:32:19.1979094Z self, 2025-05-07T20:32:19.1979170Z T: int, 2025-05-07T20:32:19.1979251Z D: int, 2025-05-07T20:32:19.1979350Z scale_ub: Optional[float], 2025-05-07T20:32:19.1979439Z contiguous: bool, 2025-05-07T20:32:19.1979523Z compiled: bool, 2025-05-07T20:32:19.1979605Z ) -> None: 2025-05-07T20:32:19.1979697Z torch.manual_seed(2025) 2025-05-07T20:32:19.1979770Z 2025-05-07T20:32:19.1979944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1980016Z 2025-05-07T20:32:19.1980114Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1980242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1980328Z x = x_sign * x_clamp 2025-05-07T20:32:19.1980409Z x0 = x[:, :D] 2025-05-07T20:32:19.1980487Z x1 = x[:, D:] 2025-05-07T20:32:19.1980557Z 2025-05-07T20:32:19.1980643Z if contiguous: 2025-05-07T20:32:19.1980733Z x0 = x0.contiguous() 2025-05-07T20:32:19.1980819Z x1 = x1.contiguous() 2025-05-07T20:32:19.1980891Z 2025-05-07T20:32:19.1981028Z if scale_ub is not None: 2025-05-07T20:32:19.1981134Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1981271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1981344Z ) 2025-05-07T20:32:19.1981424Z else: 2025-05-07T20:32:19.1981520Z scale_ub_tensor = None 2025-05-07T20:32:19.1981591Z 2025-05-07T20:32:19.1981720Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1981812Z op = silu_mul_quant 2025-05-07T20:32:19.1981941Z if compiled: 2025-05-07T20:32:19.1982039Z op = torch.compile(op) 2025-05-07T20:32:19.1982141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1982215Z 2025-05-07T20:32:19.1982303Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.1982307Z 2025-05-07T20:32:19.1982405Z moe/activation_test.py:117: 2025-05-07T20:32:19.1982530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1982630Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.1982729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.1983215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.1983351Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.1983713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.1983929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.1984270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.1984360Z kernel = self.compile( 2025-05-07T20:32:19.1984739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.1984914Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.1985042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.1985046Z 2025-05-07T20:32:19.1985251Z self = 2025-05-07T20:32:19.1986058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.1986558Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e6980>} 2025-05-07T20:32:19.1987289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.1987483Z context = 2025-05-07T20:32:19.1987488Z 2025-05-07T20:32:19.1987651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.1987909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.1988017Z module_map=module_map) 2025-05-07T20:32:19.1988179Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.1988276Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.1988353Z E ^ 2025-05-07T20:32:19.1988704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.1988709Z 2025-05-07T20:32:19.1989110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.1989115Z 2025-05-07T20:32:19.1989215Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1989472Z self=, 2025-05-07T20:32:19.1989547Z T=2048, 2025-05-07T20:32:19.1989623Z D=7168, 2025-05-07T20:32:19.1989706Z scale_ub=1200.0, 2025-05-07T20:32:19.1989792Z contiguous=True, 2025-05-07T20:32:19.1989875Z compiled=False, 2025-05-07T20:32:19.1989945Z ) 2025-05-07T20:32:19.1990165Z self = 2025-05-07T20:32:19.1990332Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1990377Z 2025-05-07T20:32:19.1990451Z @given( 2025-05-07T20:32:19.1990573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1990671Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1990782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1990899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1991009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1991087Z ) 2025-05-07T20:32:19.1991326Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1991418Z def test_silu_mul_quant( 2025-05-07T20:32:19.1991496Z self, 2025-05-07T20:32:19.1991639Z T: int, 2025-05-07T20:32:19.1991714Z D: int, 2025-05-07T20:32:19.1991815Z scale_ub: Optional[float], 2025-05-07T20:32:19.1991902Z contiguous: bool, 2025-05-07T20:32:19.1991986Z compiled: bool, 2025-05-07T20:32:19.1992067Z ) -> None: 2025-05-07T20:32:19.1992160Z torch.manual_seed(2025) 2025-05-07T20:32:19.1992232Z 2025-05-07T20:32:19.1992399Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1994134Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.1994142Z 2025-05-07T20:32:19.1994302Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.1994307Z 2025-05-07T20:32:19.1994406Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.1994628Z self=, 2025-05-07T20:32:19.1994702Z T=1, 2025-05-07T20:32:19.1994775Z D=5120, 2025-05-07T20:32:19.1994857Z scale_ub=1200.0, 2025-05-07T20:32:19.1994937Z contiguous=True, 2025-05-07T20:32:19.1995018Z compiled=False, 2025-05-07T20:32:19.1995092Z ) 2025-05-07T20:32:19.1995302Z self = 2025-05-07T20:32:19.1995464Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.1995472Z 2025-05-07T20:32:19.1995547Z @given( 2025-05-07T20:32:19.1995662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.1995767Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.1995880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.1995993Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.1996106Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.1996182Z ) 2025-05-07T20:32:19.1996423Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.1996516Z def test_silu_mul_quant( 2025-05-07T20:32:19.1996589Z self, 2025-05-07T20:32:19.1996662Z T: int, 2025-05-07T20:32:19.1996741Z D: int, 2025-05-07T20:32:19.1996836Z scale_ub: Optional[float], 2025-05-07T20:32:19.1996927Z contiguous: bool, 2025-05-07T20:32:19.1997056Z compiled: bool, 2025-05-07T20:32:19.1997132Z ) -> None: 2025-05-07T20:32:19.1997226Z torch.manual_seed(2025) 2025-05-07T20:32:19.1997296Z 2025-05-07T20:32:19.1997457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.1997534Z 2025-05-07T20:32:19.1997624Z x_sign = torch.sign(x) 2025-05-07T20:32:19.1997747Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.1997837Z x = x_sign * x_clamp 2025-05-07T20:32:19.1997956Z x0 = x[:, :D] 2025-05-07T20:32:19.1998034Z x1 = x[:, D:] 2025-05-07T20:32:19.1998110Z 2025-05-07T20:32:19.1998548Z if contiguous: 2025-05-07T20:32:19.1998653Z x0 = x0.contiguous() 2025-05-07T20:32:19.1998745Z x1 = x1.contiguous() 2025-05-07T20:32:19.1998816Z 2025-05-07T20:32:19.1998909Z if scale_ub is not None: 2025-05-07T20:32:19.1999012Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.1999148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.1999225Z ) 2025-05-07T20:32:19.1999299Z else: 2025-05-07T20:32:19.1999391Z scale_ub_tensor = None 2025-05-07T20:32:19.1999466Z 2025-05-07T20:32:19.1999683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.1999777Z op = silu_mul_quant 2025-05-07T20:32:19.1999867Z if compiled: 2025-05-07T20:32:19.1999964Z op = torch.compile(op) 2025-05-07T20:32:19.2000070Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2000141Z 2025-05-07T20:32:19.2000230Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.2000234Z 2025-05-07T20:32:19.2000331Z moe/activation_test.py:117: 2025-05-07T20:32:19.2000457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2000556Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.2000659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2001150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.2001245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.2001603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.2001890Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.2002229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.2002323Z kernel = self.compile( 2025-05-07T20:32:19.2002705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.2002880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2003003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2003010Z 2025-05-07T20:32:19.2003213Z self = 2025-05-07T20:32:19.2003979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.2004477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fcda49e7e20>} 2025-05-07T20:32:19.2005217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.2005405Z context = 2025-05-07T20:32:19.2005410Z 2025-05-07T20:32:19.2005636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.2005893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2005998Z module_map=module_map) 2025-05-07T20:32:19.2006165Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2006264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2006344Z E ^ 2025-05-07T20:32:19.2006692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2006756Z 2025-05-07T20:32:19.2007162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.2007166Z 2025-05-07T20:32:19.2007270Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2007486Z self=, 2025-05-07T20:32:19.2007566Z T=2048, 2025-05-07T20:32:19.2007646Z D=5120, 2025-05-07T20:32:19.2007730Z scale_ub=None, 2025-05-07T20:32:19.2007816Z contiguous=True, 2025-05-07T20:32:19.2007897Z compiled=False, 2025-05-07T20:32:19.2007970Z ) 2025-05-07T20:32:19.2008231Z self = 2025-05-07T20:32:19.2008404Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2008409Z 2025-05-07T20:32:19.2008487Z @given( 2025-05-07T20:32:19.2008613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2008712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2008825Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2008943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2009053Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2009129Z ) 2025-05-07T20:32:19.2009369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2009462Z def test_silu_mul_quant( 2025-05-07T20:32:19.2009539Z self, 2025-05-07T20:32:19.2009614Z T: int, 2025-05-07T20:32:19.2009687Z D: int, 2025-05-07T20:32:19.2009786Z scale_ub: Optional[float], 2025-05-07T20:32:19.2009874Z contiguous: bool, 2025-05-07T20:32:19.2010012Z compiled: bool, 2025-05-07T20:32:19.2010092Z ) -> None: 2025-05-07T20:32:19.2010184Z torch.manual_seed(2025) 2025-05-07T20:32:19.2010253Z 2025-05-07T20:32:19.2010426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2010497Z 2025-05-07T20:32:19.2010589Z > x_sign = torch.sign(x) 2025-05-07T20:32:19.2012320Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2012328Z 2025-05-07T20:32:19.2012449Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:19.2012454Z 2025-05-07T20:32:19.2012553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2012772Z self=, 2025-05-07T20:32:19.2012849Z T=16384, 2025-05-07T20:32:19.2012922Z D=5120, 2025-05-07T20:32:19.2013000Z scale_ub=None, 2025-05-07T20:32:19.2013084Z contiguous=True, 2025-05-07T20:32:19.2013165Z compiled=False, 2025-05-07T20:32:19.2013235Z ) 2025-05-07T20:32:19.2013448Z self = 2025-05-07T20:32:19.2013719Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2013724Z 2025-05-07T20:32:19.2013802Z @given( 2025-05-07T20:32:19.2013918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2014014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2014129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2014246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2014355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2014473Z ) 2025-05-07T20:32:19.2014714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2014806Z def test_silu_mul_quant( 2025-05-07T20:32:19.2014883Z self, 2025-05-07T20:32:19.2014956Z T: int, 2025-05-07T20:32:19.2015032Z D: int, 2025-05-07T20:32:19.2015127Z scale_ub: Optional[float], 2025-05-07T20:32:19.2015213Z contiguous: bool, 2025-05-07T20:32:19.2015306Z compiled: bool, 2025-05-07T20:32:19.2015382Z ) -> None: 2025-05-07T20:32:19.2015474Z torch.manual_seed(2025) 2025-05-07T20:32:19.2015547Z 2025-05-07T20:32:19.2015707Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2017475Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2017484Z 2025-05-07T20:32:19.2017600Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2017604Z 2025-05-07T20:32:19.2017708Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2017925Z self=, 2025-05-07T20:32:19.2017999Z T=4096, 2025-05-07T20:32:19.2018076Z D=5120, 2025-05-07T20:32:19.2018156Z scale_ub=None, 2025-05-07T20:32:19.2018244Z contiguous=True, 2025-05-07T20:32:19.2018326Z compiled=False, 2025-05-07T20:32:19.2018434Z ) 2025-05-07T20:32:19.2018646Z self = 2025-05-07T20:32:19.2018816Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2018820Z 2025-05-07T20:32:19.2018893Z @given( 2025-05-07T20:32:19.2019008Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2019111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2019223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2019338Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2019451Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2019521Z ) 2025-05-07T20:32:19.2019762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2019854Z def test_silu_mul_quant( 2025-05-07T20:32:19.2019929Z self, 2025-05-07T20:32:19.2020010Z T: int, 2025-05-07T20:32:19.2020086Z D: int, 2025-05-07T20:32:19.2020180Z scale_ub: Optional[float], 2025-05-07T20:32:19.2020267Z contiguous: bool, 2025-05-07T20:32:19.2020353Z compiled: bool, 2025-05-07T20:32:19.2020427Z ) -> None: 2025-05-07T20:32:19.2020524Z torch.manual_seed(2025) 2025-05-07T20:32:19.2020593Z 2025-05-07T20:32:19.2020758Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2022478Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2022526Z 2025-05-07T20:32:19.2022644Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2022648Z 2025-05-07T20:32:19.2022809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2023024Z self=, 2025-05-07T20:32:19.2023102Z T=2048, 2025-05-07T20:32:19.2023175Z D=5120, 2025-05-07T20:32:19.2023255Z scale_ub=None, 2025-05-07T20:32:19.2023342Z contiguous=False, 2025-05-07T20:32:19.2023421Z compiled=False, 2025-05-07T20:32:19.2023490Z ) 2025-05-07T20:32:19.2023703Z self = 2025-05-07T20:32:19.2023875Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.2023880Z 2025-05-07T20:32:19.2023958Z @given( 2025-05-07T20:32:19.2024074Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2024211Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2024329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2024441Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2024555Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2024630Z ) 2025-05-07T20:32:19.2024872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2024964Z def test_silu_mul_quant( 2025-05-07T20:32:19.2025040Z self, 2025-05-07T20:32:19.2025114Z T: int, 2025-05-07T20:32:19.2025193Z D: int, 2025-05-07T20:32:19.2025288Z scale_ub: Optional[float], 2025-05-07T20:32:19.2025378Z contiguous: bool, 2025-05-07T20:32:19.2025463Z compiled: bool, 2025-05-07T20:32:19.2025538Z ) -> None: 2025-05-07T20:32:19.2025628Z torch.manual_seed(2025) 2025-05-07T20:32:19.2025702Z 2025-05-07T20:32:19.2025867Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2027623Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2027632Z 2025-05-07T20:32:19.2027745Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2027751Z 2025-05-07T20:32:19.2027848Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2028063Z self=, 2025-05-07T20:32:19.2028136Z T=4096, 2025-05-07T20:32:19.2028213Z D=7168, 2025-05-07T20:32:19.2028296Z scale_ub=None, 2025-05-07T20:32:19.2028380Z contiguous=True, 2025-05-07T20:32:19.2028465Z compiled=True, 2025-05-07T20:32:19.2028535Z ) 2025-05-07T20:32:19.2028746Z self = 2025-05-07T20:32:19.2028913Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.2028918Z 2025-05-07T20:32:19.2028994Z @given( 2025-05-07T20:32:19.2029110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2029208Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2029325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2029479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2029590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2029665Z ) 2025-05-07T20:32:19.2029904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2030003Z def test_silu_mul_quant( 2025-05-07T20:32:19.2030077Z self, 2025-05-07T20:32:19.2030168Z T: int, 2025-05-07T20:32:19.2030253Z D: int, 2025-05-07T20:32:19.2030364Z scale_ub: Optional[float], 2025-05-07T20:32:19.2030497Z contiguous: bool, 2025-05-07T20:32:19.2030585Z compiled: bool, 2025-05-07T20:32:19.2030659Z ) -> None: 2025-05-07T20:32:19.2030752Z torch.manual_seed(2025) 2025-05-07T20:32:19.2030824Z 2025-05-07T20:32:19.2030987Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2032749Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2032758Z 2025-05-07T20:32:19.2032872Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2032879Z 2025-05-07T20:32:19.2032980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2033195Z self=, 2025-05-07T20:32:19.2033270Z T=2048, 2025-05-07T20:32:19.2033346Z D=5120, 2025-05-07T20:32:19.2033427Z scale_ub=1200.0, 2025-05-07T20:32:19.2033510Z contiguous=False, 2025-05-07T20:32:19.2033592Z compiled=False, 2025-05-07T20:32:19.2033670Z ) 2025-05-07T20:32:19.2033882Z self = 2025-05-07T20:32:19.2034057Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.2034061Z 2025-05-07T20:32:19.2034134Z @given( 2025-05-07T20:32:19.2034251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2034392Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2034505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2034624Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2034733Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2034805Z ) 2025-05-07T20:32:19.2035045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2035137Z def test_silu_mul_quant( 2025-05-07T20:32:19.2035211Z self, 2025-05-07T20:32:19.2035289Z T: int, 2025-05-07T20:32:19.2035362Z D: int, 2025-05-07T20:32:19.2035457Z scale_ub: Optional[float], 2025-05-07T20:32:19.2035547Z contiguous: bool, 2025-05-07T20:32:19.2035629Z compiled: bool, 2025-05-07T20:32:19.2035708Z ) -> None: 2025-05-07T20:32:19.2035800Z torch.manual_seed(2025) 2025-05-07T20:32:19.2035872Z 2025-05-07T20:32:19.2036043Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2037755Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2037806Z 2025-05-07T20:32:19.2037921Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2037926Z 2025-05-07T20:32:19.2038023Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2038238Z self=, 2025-05-07T20:32:19.2038319Z T=4096, 2025-05-07T20:32:19.2038393Z D=7168, 2025-05-07T20:32:19.2038478Z scale_ub=1200.0, 2025-05-07T20:32:19.2038561Z contiguous=True, 2025-05-07T20:32:19.2038642Z compiled=False, 2025-05-07T20:32:19.2038752Z ) 2025-05-07T20:32:19.2038969Z self = 2025-05-07T20:32:19.2039135Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.2039139Z 2025-05-07T20:32:19.2039215Z @given( 2025-05-07T20:32:19.2039329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2039427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2039553Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2039666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2039776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2039849Z ) 2025-05-07T20:32:19.2040128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2040226Z def test_silu_mul_quant( 2025-05-07T20:32:19.2040301Z self, 2025-05-07T20:32:19.2040375Z T: int, 2025-05-07T20:32:19.2040451Z D: int, 2025-05-07T20:32:19.2040547Z scale_ub: Optional[float], 2025-05-07T20:32:19.2040633Z contiguous: bool, 2025-05-07T20:32:19.2040722Z compiled: bool, 2025-05-07T20:32:19.2040798Z ) -> None: 2025-05-07T20:32:19.2040890Z torch.manual_seed(2025) 2025-05-07T20:32:19.2040962Z 2025-05-07T20:32:19.2041122Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2042884Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2042895Z 2025-05-07T20:32:19.2043009Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2043013Z 2025-05-07T20:32:19.2043114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2043328Z self=, 2025-05-07T20:32:19.2043402Z T=16384, 2025-05-07T20:32:19.2043479Z D=7168, 2025-05-07T20:32:19.2043557Z scale_ub=None, 2025-05-07T20:32:19.2043641Z contiguous=False, 2025-05-07T20:32:19.2043724Z compiled=True, 2025-05-07T20:32:19.2043795Z ) 2025-05-07T20:32:19.2044004Z self = 2025-05-07T20:32:19.2044175Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:19.2044183Z 2025-05-07T20:32:19.2044257Z @given( 2025-05-07T20:32:19.2044373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2044473Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2044585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2044704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2044814Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2044886Z ) 2025-05-07T20:32:19.2045126Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2045218Z def test_silu_mul_quant( 2025-05-07T20:32:19.2045292Z self, 2025-05-07T20:32:19.2045413Z T: int, 2025-05-07T20:32:19.2045490Z D: int, 2025-05-07T20:32:19.2045583Z scale_ub: Optional[float], 2025-05-07T20:32:19.2045673Z contiguous: bool, 2025-05-07T20:32:19.2045755Z compiled: bool, 2025-05-07T20:32:19.2045832Z ) -> None: 2025-05-07T20:32:19.2045926Z torch.manual_seed(2025) 2025-05-07T20:32:19.2045999Z 2025-05-07T20:32:19.2046165Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2047880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2047930Z 2025-05-07T20:32:19.2048046Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2048050Z 2025-05-07T20:32:19.2048148Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2048398Z self=, 2025-05-07T20:32:19.2048479Z T=4096, 2025-05-07T20:32:19.2048553Z D=7168, 2025-05-07T20:32:19.2048631Z scale_ub=None, 2025-05-07T20:32:19.2048720Z contiguous=True, 2025-05-07T20:32:19.2048800Z compiled=False, 2025-05-07T20:32:19.2048869Z ) 2025-05-07T20:32:19.2049083Z self = 2025-05-07T20:32:19.2049246Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2049250Z 2025-05-07T20:32:19.2049326Z @given( 2025-05-07T20:32:19.2049440Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2049540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2049654Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2049767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2049876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2049953Z ) 2025-05-07T20:32:19.2050229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2050323Z def test_silu_mul_quant( 2025-05-07T20:32:19.2050397Z self, 2025-05-07T20:32:19.2050473Z T: int, 2025-05-07T20:32:19.2050548Z D: int, 2025-05-07T20:32:19.2050642Z scale_ub: Optional[float], 2025-05-07T20:32:19.2050728Z contiguous: bool, 2025-05-07T20:32:19.2050812Z compiled: bool, 2025-05-07T20:32:19.2050887Z ) -> None: 2025-05-07T20:32:19.2050978Z torch.manual_seed(2025) 2025-05-07T20:32:19.2051050Z 2025-05-07T20:32:19.2051211Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2052933Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2052941Z 2025-05-07T20:32:19.2053053Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2053058Z 2025-05-07T20:32:19.2053159Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2053374Z self=, 2025-05-07T20:32:19.2053448Z T=16384, 2025-05-07T20:32:19.2053523Z D=7168, 2025-05-07T20:32:19.2053717Z scale_ub=None, 2025-05-07T20:32:19.2053799Z contiguous=True, 2025-05-07T20:32:19.2053881Z compiled=False, 2025-05-07T20:32:19.2053951Z ) 2025-05-07T20:32:19.2054160Z self = 2025-05-07T20:32:19.2054334Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:19.2054341Z 2025-05-07T20:32:19.2054414Z @given( 2025-05-07T20:32:19.2054527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2054685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2054797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2054912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2055022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2055094Z ) 2025-05-07T20:32:19.2055338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2055435Z def test_silu_mul_quant( 2025-05-07T20:32:19.2055509Z self, 2025-05-07T20:32:19.2055585Z T: int, 2025-05-07T20:32:19.2055658Z D: int, 2025-05-07T20:32:19.2055751Z scale_ub: Optional[float], 2025-05-07T20:32:19.2055839Z contiguous: bool, 2025-05-07T20:32:19.2055963Z compiled: bool, 2025-05-07T20:32:19.2056042Z ) -> None: 2025-05-07T20:32:19.2056136Z torch.manual_seed(2025) 2025-05-07T20:32:19.2056206Z 2025-05-07T20:32:19.2056367Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2058086Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2058095Z 2025-05-07T20:32:19.2058211Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2058215Z 2025-05-07T20:32:19.2058314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2058570Z self=, 2025-05-07T20:32:19.2058651Z T=16384, 2025-05-07T20:32:19.2058724Z D=7168, 2025-05-07T20:32:19.2058807Z scale_ub=1200.0, 2025-05-07T20:32:19.2058890Z contiguous=True, 2025-05-07T20:32:19.2058971Z compiled=False, 2025-05-07T20:32:19.2059041Z ) 2025-05-07T20:32:19.2059254Z self = 2025-05-07T20:32:19.2059425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.2059429Z 2025-05-07T20:32:19.2059506Z @given( 2025-05-07T20:32:19.2059622Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2059718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2059832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2059943Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2060056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2060131Z ) 2025-05-07T20:32:19.2060377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2060490Z def test_silu_mul_quant( 2025-05-07T20:32:19.2060574Z self, 2025-05-07T20:32:19.2060663Z T: int, 2025-05-07T20:32:19.2060739Z D: int, 2025-05-07T20:32:19.2060832Z scale_ub: Optional[float], 2025-05-07T20:32:19.2060917Z contiguous: bool, 2025-05-07T20:32:19.2061002Z compiled: bool, 2025-05-07T20:32:19.2061078Z ) -> None: 2025-05-07T20:32:19.2061169Z torch.manual_seed(2025) 2025-05-07T20:32:19.2061290Z 2025-05-07T20:32:19.2061450Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2063173Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2063222Z 2025-05-07T20:32:19.2063335Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2063340Z 2025-05-07T20:32:19.2063440Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2063656Z self=, 2025-05-07T20:32:19.2063733Z T=128, 2025-05-07T20:32:19.2063811Z D=5120, 2025-05-07T20:32:19.2063891Z scale_ub=1200.0, 2025-05-07T20:32:19.2063974Z contiguous=False, 2025-05-07T20:32:19.2064057Z compiled=False, 2025-05-07T20:32:19.2064129Z ) 2025-05-07T20:32:19.2064379Z self = 2025-05-07T20:32:19.2064554Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:19.2064559Z 2025-05-07T20:32:19.2064633Z @given( 2025-05-07T20:32:19.2064753Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2064849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2064961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2065076Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2065186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2065257Z ) 2025-05-07T20:32:19.2065496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2065589Z def test_silu_mul_quant( 2025-05-07T20:32:19.2065662Z self, 2025-05-07T20:32:19.2065740Z T: int, 2025-05-07T20:32:19.2065814Z D: int, 2025-05-07T20:32:19.2065910Z scale_ub: Optional[float], 2025-05-07T20:32:19.2066000Z contiguous: bool, 2025-05-07T20:32:19.2066124Z compiled: bool, 2025-05-07T20:32:19.2066203Z ) -> None: 2025-05-07T20:32:19.2066295Z torch.manual_seed(2025) 2025-05-07T20:32:19.2066368Z 2025-05-07T20:32:19.2066533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2066604Z 2025-05-07T20:32:19.2066691Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2066818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2066902Z x = x_sign * x_clamp 2025-05-07T20:32:19.2066982Z x0 = x[:, :D] 2025-05-07T20:32:19.2067063Z x1 = x[:, D:] 2025-05-07T20:32:19.2067136Z 2025-05-07T20:32:19.2067216Z if contiguous: 2025-05-07T20:32:19.2067307Z x0 = x0.contiguous() 2025-05-07T20:32:19.2067393Z x1 = x1.contiguous() 2025-05-07T20:32:19.2067463Z 2025-05-07T20:32:19.2067555Z if scale_ub is not None: 2025-05-07T20:32:19.2067661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.2067800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.2067872Z ) 2025-05-07T20:32:19.2067946Z else: 2025-05-07T20:32:19.2068041Z scale_ub_tensor = None 2025-05-07T20:32:19.2068111Z 2025-05-07T20:32:19.2068236Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.2068325Z op = silu_mul_quant 2025-05-07T20:32:19.2068408Z if compiled: 2025-05-07T20:32:19.2068505Z op = torch.compile(op) 2025-05-07T20:32:19.2068611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2068681Z 2025-05-07T20:32:19.2068813Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.2068821Z 2025-05-07T20:32:19.2068913Z moe/activation_test.py:117: 2025-05-07T20:32:19.2069041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2069140Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.2069238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2069729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.2069869Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.2070225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.2070445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.2070781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.2070875Z kernel = self.compile( 2025-05-07T20:32:19.2071255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.2071428Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2071590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2071596Z 2025-05-07T20:32:19.2071801Z self = 2025-05-07T20:32:19.2072563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.2073067Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc96fe8cae0>} 2025-05-07T20:32:19.2073799Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.2073990Z context = 2025-05-07T20:32:19.2073997Z 2025-05-07T20:32:19.2074197Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.2074457Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2074571Z module_map=module_map) 2025-05-07T20:32:19.2074729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2074826Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2074905Z E ^ 2025-05-07T20:32:19.2075253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2075260Z 2025-05-07T20:32:19.2075666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.2075671Z 2025-05-07T20:32:19.2075772Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2075992Z self=, 2025-05-07T20:32:19.2076073Z T=2048, 2025-05-07T20:32:19.2076150Z D=7168, 2025-05-07T20:32:19.2076232Z scale_ub=None, 2025-05-07T20:32:19.2076321Z contiguous=False, 2025-05-07T20:32:19.2076406Z compiled=False, 2025-05-07T20:32:19.2076481Z ) 2025-05-07T20:32:19.2076692Z self = 2025-05-07T20:32:19.2076861Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:19.2076865Z 2025-05-07T20:32:19.2076941Z @given( 2025-05-07T20:32:19.2077056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2077151Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2077312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2077430Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2077539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2077614Z ) 2025-05-07T20:32:19.2077859Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2077953Z def test_silu_mul_quant( 2025-05-07T20:32:19.2078027Z self, 2025-05-07T20:32:19.2078101Z T: int, 2025-05-07T20:32:19.2078221Z D: int, 2025-05-07T20:32:19.2078315Z scale_ub: Optional[float], 2025-05-07T20:32:19.2078401Z contiguous: bool, 2025-05-07T20:32:19.2078487Z compiled: bool, 2025-05-07T20:32:19.2078563Z ) -> None: 2025-05-07T20:32:19.2078655Z torch.manual_seed(2025) 2025-05-07T20:32:19.2078729Z 2025-05-07T20:32:19.2078892Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2080663Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2080672Z 2025-05-07T20:32:19.2080788Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2080792Z 2025-05-07T20:32:19.2080894Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2081111Z self=, 2025-05-07T20:32:19.2081184Z T=128, 2025-05-07T20:32:19.2081260Z D=7168, 2025-05-07T20:32:19.2081339Z scale_ub=1200.0, 2025-05-07T20:32:19.2081423Z contiguous=True, 2025-05-07T20:32:19.2081505Z compiled=True, 2025-05-07T20:32:19.2081574Z ) 2025-05-07T20:32:19.2081789Z self = 2025-05-07T20:32:19.2081958Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.2081962Z 2025-05-07T20:32:19.2082102Z @given( 2025-05-07T20:32:19.2082224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2082321Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2082436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2082552Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2082662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2082734Z ) 2025-05-07T20:32:19.2082977Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2083067Z def test_silu_mul_quant( 2025-05-07T20:32:19.2083146Z self, 2025-05-07T20:32:19.2083223Z T: int, 2025-05-07T20:32:19.2083296Z D: int, 2025-05-07T20:32:19.2083391Z scale_ub: Optional[float], 2025-05-07T20:32:19.2083481Z contiguous: bool, 2025-05-07T20:32:19.2083564Z compiled: bool, 2025-05-07T20:32:19.2083644Z ) -> None: 2025-05-07T20:32:19.2083737Z torch.manual_seed(2025) 2025-05-07T20:32:19.2083809Z 2025-05-07T20:32:19.2083974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2084049Z 2025-05-07T20:32:19.2084138Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2084264Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2084350Z x = x_sign * x_clamp 2025-05-07T20:32:19.2084428Z x0 = x[:, :D] 2025-05-07T20:32:19.2084509Z x1 = x[:, D:] 2025-05-07T20:32:19.2084578Z 2025-05-07T20:32:19.2084658Z if contiguous: 2025-05-07T20:32:19.2084750Z x0 = x0.contiguous() 2025-05-07T20:32:19.2084883Z x1 = x1.contiguous() 2025-05-07T20:32:19.2084955Z 2025-05-07T20:32:19.2085042Z if scale_ub is not None: 2025-05-07T20:32:19.2085145Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:19.2085280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:19.2085353Z ) 2025-05-07T20:32:19.2085431Z else: 2025-05-07T20:32:19.2085524Z scale_ub_tensor = None 2025-05-07T20:32:19.2085595Z 2025-05-07T20:32:19.2085721Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:19.2085853Z op = silu_mul_quant 2025-05-07T20:32:19.2085935Z if compiled: 2025-05-07T20:32:19.2086032Z op = torch.compile(op) 2025-05-07T20:32:19.2086136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2086205Z 2025-05-07T20:32:19.2086295Z > y_fp8, y_scale = fn() 2025-05-07T20:32:19.2086299Z 2025-05-07T20:32:19.2086392Z moe/activation_test.py:117: 2025-05-07T20:32:19.2086521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2086622Z moe/activation_test.py:115: in fn 2025-05-07T20:32:19.2086720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:19.2087123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:19.2087219Z return fn(*args, **kwargs) 2025-05-07T20:32:19.2087704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:19.2087807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:19.2088158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:19.2088376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:19.2088710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:19.2088802Z kernel = self.compile( 2025-05-07T20:32:19.2089178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:19.2093348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:19.2093558Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:19.2093564Z 2025-05-07T20:32:19.2093853Z self = 2025-05-07T20:32:19.2094624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:19.2095125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fc96fd10040>} 2025-05-07T20:32:19.2095865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:19.2096055Z context = 2025-05-07T20:32:19.2096063Z 2025-05-07T20:32:19.2096226Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:19.2096482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:19.2096593Z module_map=module_map) 2025-05-07T20:32:19.2096752Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:19.2096847Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:19.2096925Z E ^ 2025-05-07T20:32:19.2097271Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:19.2097322Z 2025-05-07T20:32:19.2097726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:19.2097737Z 2025-05-07T20:32:19.2097837Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2098057Z self=, 2025-05-07T20:32:19.2098136Z T=128, 2025-05-07T20:32:19.2098509Z D=7168, 2025-05-07T20:32:19.2098613Z scale_ub=1200.0, 2025-05-07T20:32:19.2098789Z contiguous=True, 2025-05-07T20:32:19.2098867Z compiled=False, 2025-05-07T20:32:19.2098936Z ) 2025-05-07T20:32:19.2099152Z self = 2025-05-07T20:32:19.2099315Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:19.2099319Z 2025-05-07T20:32:19.2099392Z @given( 2025-05-07T20:32:19.2099511Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2099609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2099720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2099832Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2100003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2100080Z ) 2025-05-07T20:32:19.2100324Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2100411Z def test_silu_mul_quant( 2025-05-07T20:32:19.2100489Z self, 2025-05-07T20:32:19.2100560Z T: int, 2025-05-07T20:32:19.2100629Z D: int, 2025-05-07T20:32:19.2100726Z scale_ub: Optional[float], 2025-05-07T20:32:19.2100811Z contiguous: bool, 2025-05-07T20:32:19.2100893Z compiled: bool, 2025-05-07T20:32:19.2100965Z ) -> None: 2025-05-07T20:32:19.2101054Z torch.manual_seed(2025) 2025-05-07T20:32:19.2101125Z 2025-05-07T20:32:19.2101286Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2101359Z 2025-05-07T20:32:19.2101447Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2101567Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2103357Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2103370Z 2025-05-07T20:32:19.2103482Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.2103487Z 2025-05-07T20:32:19.2103582Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2103802Z self=, 2025-05-07T20:32:19.2103874Z T=128, 2025-05-07T20:32:19.2103950Z D=5120, 2025-05-07T20:32:19.2104028Z scale_ub=1200.0, 2025-05-07T20:32:19.2104106Z contiguous=True, 2025-05-07T20:32:19.2104188Z compiled=True, 2025-05-07T20:32:19.2104258Z ) 2025-05-07T20:32:19.2104467Z self = 2025-05-07T20:32:19.2104632Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:19.2104639Z 2025-05-07T20:32:19.2104709Z @given( 2025-05-07T20:32:19.2104823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2104921Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2105029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2105146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2105254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2105384Z ) 2025-05-07T20:32:19.2105622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2105711Z def test_silu_mul_quant( 2025-05-07T20:32:19.2105780Z self, 2025-05-07T20:32:19.2105859Z T: int, 2025-05-07T20:32:19.2105929Z D: int, 2025-05-07T20:32:19.2106023Z scale_ub: Optional[float], 2025-05-07T20:32:19.2106108Z contiguous: bool, 2025-05-07T20:32:19.2106190Z compiled: bool, 2025-05-07T20:32:19.2106303Z ) -> None: 2025-05-07T20:32:19.2106400Z torch.manual_seed(2025) 2025-05-07T20:32:19.2106470Z 2025-05-07T20:32:19.2106632Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2106706Z 2025-05-07T20:32:19.2106795Z x_sign = torch.sign(x) 2025-05-07T20:32:19.2106916Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:19.2108672Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2108689Z 2025-05-07T20:32:19.2108803Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:19.2108808Z 2025-05-07T20:32:19.2108903Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:19.2109121Z self=, 2025-05-07T20:32:19.2109193Z T=128, 2025-05-07T20:32:19.2109261Z D=7168, 2025-05-07T20:32:19.2109343Z scale_ub=None, 2025-05-07T20:32:19.2109420Z contiguous=True, 2025-05-07T20:32:19.2109498Z compiled=True, 2025-05-07T20:32:19.2109567Z ) 2025-05-07T20:32:19.2109774Z self = 2025-05-07T20:32:19.2109934Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:19.2109942Z 2025-05-07T20:32:19.2110014Z @given( 2025-05-07T20:32:19.2110169Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:19.2110267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:19.2110379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:19.2110491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:19.2110600Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:19.2110672Z ) 2025-05-07T20:32:19.2110912Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:19.2110999Z def test_silu_mul_quant( 2025-05-07T20:32:19.2111068Z self, 2025-05-07T20:32:19.2111145Z T: int, 2025-05-07T20:32:19.2111218Z D: int, 2025-05-07T20:32:19.2111310Z scale_ub: Optional[float], 2025-05-07T20:32:19.2111396Z contiguous: bool, 2025-05-07T20:32:19.2111476Z compiled: bool, 2025-05-07T20:32:19.2111550Z ) -> None: 2025-05-07T20:32:19.2111645Z torch.manual_seed(2025) 2025-05-07T20:32:19.2111713Z 2025-05-07T20:32:19.2111878Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:19.2113584Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:19.2113646Z 2025-05-07T20:32:19.2113770Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:19.2113949Z =============================== warnings summary =============================== 2025-05-07T20:32:19.2114368Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:19.2114699Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:19.2115046Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:19.2115904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:19.2116130Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:19.2116138Z 2025-05-07T20:32:19.2116343Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:19.2116507Z ================= 1 failed, 1 deselected, 3 warnings in 13.83s ================= 2025-05-07T20:32:20.8856322Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:20.9487952Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:32:20.9488184Z 2025-05-07T20:32:22.9508957Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:25.1300737Z ============================= test session starts ============================== 2025-05-07T20:32:25.1301961Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:25.1302791Z cachedir: .pytest_cache 2025-05-07T20:32:25.1303378Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:25.1304358Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:25.1304760Z plugins: hypothesis-6.131.14 2025-05-07T20:32:26.6740301Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:26.7699404Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:26.7699812Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:26.7700038Z 2025-05-07T20:32:28.8743779Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8744481Z self=, 2025-05-07T20:32:28.8744914Z T=1, 2025-05-07T20:32:28.8745113Z D=5120, 2025-05-07T20:32:28.8745305Z scale_ub=None, 2025-05-07T20:32:28.8745526Z contiguous=True, 2025-05-07T20:32:28.8745755Z compiled=True, 2025-05-07T20:32:28.8745958Z ) 2025-05-07T20:32:28.8746286Z self = 2025-05-07T20:32:28.8746784Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:28.8747037Z 2025-05-07T20:32:28.8747127Z @given( 2025-05-07T20:32:28.8747360Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8747677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8747981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8748303Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8748629Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8748920Z ) 2025-05-07T20:32:28.8749261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8750008Z def test_silu_mul_quant( 2025-05-07T20:32:28.8750254Z self, 2025-05-07T20:32:28.8750453Z T: int, 2025-05-07T20:32:28.8750648Z D: int, 2025-05-07T20:32:28.8750868Z scale_ub: Optional[float], 2025-05-07T20:32:28.8751143Z contiguous: bool, 2025-05-07T20:32:28.8751382Z compiled: bool, 2025-05-07T20:32:28.8751613Z ) -> None: 2025-05-07T20:32:28.8751835Z torch.manual_seed(2025) 2025-05-07T20:32:28.8752183Z 2025-05-07T20:32:28.8752459Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8752799Z 2025-05-07T20:32:28.8752987Z x_sign = torch.sign(x) 2025-05-07T20:32:28.8753329Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.8753642Z x = x_sign * x_clamp 2025-05-07T20:32:28.8753880Z x0 = x[:, :D] 2025-05-07T20:32:28.8754102Z x1 = x[:, D:] 2025-05-07T20:32:28.8754322Z 2025-05-07T20:32:28.8754508Z if contiguous: 2025-05-07T20:32:28.8754741Z x0 = x0.contiguous() 2025-05-07T20:32:28.8755000Z x1 = x1.contiguous() 2025-05-07T20:32:28.8755235Z 2025-05-07T20:32:28.8755433Z if scale_ub is not None: 2025-05-07T20:32:28.8755797Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.8756138Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.8756445Z ) 2025-05-07T20:32:28.8756646Z else: 2025-05-07T20:32:28.8756862Z scale_ub_tensor = None 2025-05-07T20:32:28.8757115Z 2025-05-07T20:32:28.8757347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8757663Z op = silu_mul_quant 2025-05-07T20:32:28.8757908Z if compiled: 2025-05-07T20:32:28.8758159Z op = torch.compile(op) 2025-05-07T20:32:28.8758459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8758732Z 2025-05-07T20:32:28.8758930Z y_fp8, y_scale = fn() 2025-05-07T20:32:28.8759215Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:28.8759499Z 2025-05-07T20:32:28.8759737Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8760079Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:28.8760383Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:28.8760787Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:28.8761148Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.8761454Z 2025-05-07T20:32:28.8761663Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:28.8761854Z 2025-05-07T20:32:28.8761965Z moe/activation_test.py:126: 2025-05-07T20:32:28.8762257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8762596Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:28.8762922Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:28.8763710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:28.8764446Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:28.8764997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.8765675Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.8766352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:28.8767067Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:28.8767787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:28.8768417Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:28.8769057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:28.8769569Z fn() 2025-05-07T20:32:28.8770070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:28.8770645Z self.fn.run( 2025-05-07T20:32:28.8771103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.8771624Z kernel = self.compile( 2025-05-07T20:32:28.8772204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.8772846Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8773289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8773522Z 2025-05-07T20:32:28.8773918Z self = 2025-05-07T20:32:28.8774991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.8776396Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c52aa700>} 2025-05-07T20:32:28.8777723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.8778731Z context = 2025-05-07T20:32:28.8779013Z 2025-05-07T20:32:28.8779182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.8779698Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8780158Z module_map=module_map) 2025-05-07T20:32:28.8780523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8780877Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:28.8781142Z E ^ 2025-05-07T20:32:28.8781648Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8782087Z 2025-05-07T20:32:28.8782504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:28.8783013Z 2025-05-07T20:32:28.8783124Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:28.8783526Z self=, 2025-05-07T20:32:28.8783927Z T=2048, 2025-05-07T20:32:28.8784121Z D=5120, 2025-05-07T20:32:28.8784309Z scale_ub=1200.0, 2025-05-07T20:32:28.8784537Z contiguous=True, 2025-05-07T20:32:28.8784758Z compiled=False, 2025-05-07T20:32:28.8784956Z ) 2025-05-07T20:32:28.8785277Z self = 2025-05-07T20:32:28.8785772Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:28.8786043Z 2025-05-07T20:32:28.8786127Z @given( 2025-05-07T20:32:28.8786361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:28.8786674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:28.8786980Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:28.8787301Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:28.8787630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:28.8787915Z ) 2025-05-07T20:32:28.8788257Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:28.8788698Z def test_silu_mul_quant( 2025-05-07T20:32:28.8788943Z self, 2025-05-07T20:32:28.8789184Z T: int, 2025-05-07T20:32:28.8789383Z D: int, 2025-05-07T20:32:28.8789601Z scale_ub: Optional[float], 2025-05-07T20:32:28.8789864Z contiguous: bool, 2025-05-07T20:32:28.8790105Z compiled: bool, 2025-05-07T20:32:28.8790324Z ) -> None: 2025-05-07T20:32:28.8790532Z torch.manual_seed(2025) 2025-05-07T20:32:28.8790776Z 2025-05-07T20:32:28.8791047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:28.8791388Z 2025-05-07T20:32:28.8791627Z x_sign = torch.sign(x) 2025-05-07T20:32:28.8791913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:28.8792219Z x = x_sign * x_clamp 2025-05-07T20:32:28.8792455Z x0 = x[:, :D] 2025-05-07T20:32:28.8792672Z x1 = x[:, D:] 2025-05-07T20:32:28.8792880Z 2025-05-07T20:32:28.8793061Z if contiguous: 2025-05-07T20:32:28.8793293Z x0 = x0.contiguous() 2025-05-07T20:32:28.8793551Z x1 = x1.contiguous() 2025-05-07T20:32:28.8793787Z 2025-05-07T20:32:28.8793979Z if scale_ub is not None: 2025-05-07T20:32:28.8794255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:28.8794585Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:28.8794946Z ) 2025-05-07T20:32:28.8795141Z else: 2025-05-07T20:32:28.8795349Z scale_ub_tensor = None 2025-05-07T20:32:28.8795605Z 2025-05-07T20:32:28.8795837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:28.8796157Z op = silu_mul_quant 2025-05-07T20:32:28.8796400Z if compiled: 2025-05-07T20:32:28.8796645Z op = torch.compile(op) 2025-05-07T20:32:28.8796942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8797212Z 2025-05-07T20:32:28.8797407Z > y_fp8, y_scale = fn() 2025-05-07T20:32:28.8797569Z 2025-05-07T20:32:28.8797674Z moe/activation_test.py:117: 2025-05-07T20:32:28.8797963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8798666Z moe/activation_test.py:115: in fn 2025-05-07T20:32:28.8798951Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:28.8799633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:28.8800443Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:28.8800978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:28.8801658Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:28.8802309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:28.8802840Z kernel = self.compile( 2025-05-07T20:32:28.8803431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:28.8804079Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:28.8804471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:28.8804706Z 2025-05-07T20:32:28.8804917Z self = 2025-05-07T20:32:28.8805987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:28.8807342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c5162020>} 2025-05-07T20:32:28.8808659Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:28.8809741Z context = 2025-05-07T20:32:28.8810028Z 2025-05-07T20:32:28.8810192Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:28.8810714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:28.8811171Z module_map=module_map) 2025-05-07T20:32:28.8811533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:28.8811957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:28.8812212Z E ^ 2025-05-07T20:32:28.8812666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:28.8813108Z 2025-05-07T20:32:28.8813514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5335137Z 2025-05-07T20:32:29.5335506Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5336069Z self=, 2025-05-07T20:32:29.5336527Z T=2048, 2025-05-07T20:32:29.5336732Z D=5120, 2025-05-07T20:32:29.5337330Z scale_ub=1200.0, 2025-05-07T20:32:29.5337560Z contiguous=True, 2025-05-07T20:32:29.5337795Z compiled=True, 2025-05-07T20:32:29.5338008Z ) 2025-05-07T20:32:29.5338327Z self = 2025-05-07T20:32:29.5338873Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:29.5339145Z 2025-05-07T20:32:29.5339227Z @given( 2025-05-07T20:32:29.5339464Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5339779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5340086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5340413Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5340747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5341034Z ) 2025-05-07T20:32:29.5341379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5341821Z def test_silu_mul_quant( 2025-05-07T20:32:29.5342073Z self, 2025-05-07T20:32:29.5342375Z T: int, 2025-05-07T20:32:29.5342578Z D: int, 2025-05-07T20:32:29.5342802Z scale_ub: Optional[float], 2025-05-07T20:32:29.5343069Z contiguous: bool, 2025-05-07T20:32:29.5343311Z compiled: bool, 2025-05-07T20:32:29.5343538Z ) -> None: 2025-05-07T20:32:29.5343746Z torch.manual_seed(2025) 2025-05-07T20:32:29.5343988Z 2025-05-07T20:32:29.5344262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5344601Z 2025-05-07T20:32:29.5344793Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5345087Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5345404Z x = x_sign * x_clamp 2025-05-07T20:32:29.5345637Z x0 = x[:, :D] 2025-05-07T20:32:29.5345852Z x1 = x[:, D:] 2025-05-07T20:32:29.5346064Z 2025-05-07T20:32:29.5346242Z if contiguous: 2025-05-07T20:32:29.5346476Z x0 = x0.contiguous() 2025-05-07T20:32:29.5346736Z x1 = x1.contiguous() 2025-05-07T20:32:29.5346969Z 2025-05-07T20:32:29.5347162Z if scale_ub is not None: 2025-05-07T20:32:29.5347435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5347768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5348078Z ) 2025-05-07T20:32:29.5348274Z else: 2025-05-07T20:32:29.5348481Z scale_ub_tensor = None 2025-05-07T20:32:29.5348734Z 2025-05-07T20:32:29.5348971Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5349280Z op = silu_mul_quant 2025-05-07T20:32:29.5349530Z if compiled: 2025-05-07T20:32:29.5349879Z op = torch.compile(op) 2025-05-07T20:32:29.5350178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5350450Z 2025-05-07T20:32:29.5350649Z y_fp8, y_scale = fn() 2025-05-07T20:32:29.5350937Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:29.5351226Z 2025-05-07T20:32:29.5351466Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5351803Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:29.5352091Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:29.5352504Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:29.5352869Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.5353171Z 2025-05-07T20:32:29.5353374Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:29.5353588Z 2025-05-07T20:32:29.5353688Z moe/activation_test.py:126: 2025-05-07T20:32:29.5353987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5354322Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:29.5354652Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:29.5355483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:29.5364627Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:29.5365245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5365930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5366610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:29.5367315Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:29.5368027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:29.5368658Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:29.5369256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:29.5369761Z fn() 2025-05-07T20:32:29.5370342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:29.5370921Z self.fn.run( 2025-05-07T20:32:29.5371379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5371906Z kernel = self.compile( 2025-05-07T20:32:29.5372442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5373093Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5373487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5373854Z 2025-05-07T20:32:29.5374060Z self = 2025-05-07T20:32:29.5375142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5376504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c402ede0>} 2025-05-07T20:32:29.5377820Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5378830Z context = 2025-05-07T20:32:29.5379177Z 2025-05-07T20:32:29.5379340Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5379855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5380317Z module_map=module_map) 2025-05-07T20:32:29.5380683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5381036Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:29.5381301Z E ^ 2025-05-07T20:32:29.5381802Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5382246Z 2025-05-07T20:32:29.5382654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:29.5383158Z 2025-05-07T20:32:29.5383266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:29.5383667Z self=, 2025-05-07T20:32:29.5384070Z T=16384, 2025-05-07T20:32:29.5384265Z D=7168, 2025-05-07T20:32:29.5384461Z scale_ub=1200.0, 2025-05-07T20:32:29.5384679Z contiguous=False, 2025-05-07T20:32:29.5384906Z compiled=False, 2025-05-07T20:32:29.5385111Z ) 2025-05-07T20:32:29.5385468Z self = 2025-05-07T20:32:29.5385954Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:29.5386233Z 2025-05-07T20:32:29.5386309Z @given( 2025-05-07T20:32:29.5386539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:29.5386841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:29.5387144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:29.5387466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:29.5387782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:29.5388071Z ) 2025-05-07T20:32:29.5388420Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:29.5388858Z def test_silu_mul_quant( 2025-05-07T20:32:29.5389087Z self, 2025-05-07T20:32:29.5389282Z T: int, 2025-05-07T20:32:29.5389475Z D: int, 2025-05-07T20:32:29.5389690Z scale_ub: Optional[float], 2025-05-07T20:32:29.5390010Z contiguous: bool, 2025-05-07T20:32:29.5390249Z compiled: bool, 2025-05-07T20:32:29.5390461Z ) -> None: 2025-05-07T20:32:29.5390672Z torch.manual_seed(2025) 2025-05-07T20:32:29.5390918Z 2025-05-07T20:32:29.5391181Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:29.5391524Z 2025-05-07T20:32:29.5391713Z x_sign = torch.sign(x) 2025-05-07T20:32:29.5391994Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:29.5392301Z x = x_sign * x_clamp 2025-05-07T20:32:29.5392533Z x0 = x[:, :D] 2025-05-07T20:32:29.5392744Z x1 = x[:, D:] 2025-05-07T20:32:29.5392950Z 2025-05-07T20:32:29.5393135Z if contiguous: 2025-05-07T20:32:29.5393356Z x0 = x0.contiguous() 2025-05-07T20:32:29.5393612Z x1 = x1.contiguous() 2025-05-07T20:32:29.5393845Z 2025-05-07T20:32:29.5394034Z if scale_ub is not None: 2025-05-07T20:32:29.5394304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:29.5394632Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:29.5394937Z ) 2025-05-07T20:32:29.5395117Z else: 2025-05-07T20:32:29.5395311Z scale_ub_tensor = None 2025-05-07T20:32:29.5395560Z 2025-05-07T20:32:29.5395779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:29.5396090Z op = silu_mul_quant 2025-05-07T20:32:29.5396335Z if compiled: 2025-05-07T20:32:29.5396571Z op = torch.compile(op) 2025-05-07T20:32:29.5396864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5397185Z 2025-05-07T20:32:29.5397367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:29.5397536Z 2025-05-07T20:32:29.5397630Z moe/activation_test.py:117: 2025-05-07T20:32:29.5397925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5398564Z moe/activation_test.py:115: in fn 2025-05-07T20:32:29.5398833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:29.5399505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:29.5400273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:29.5400797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:29.5401468Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:29.5402128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:29.5402659Z kernel = self.compile( 2025-05-07T20:32:29.5403185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:29.5403941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:29.5404337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:29.5404560Z 2025-05-07T20:32:29.5404769Z self = 2025-05-07T20:32:29.5405826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:29.5407176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c42e8ae0>} 2025-05-07T20:32:29.5408495Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:29.5409498Z context = 2025-05-07T20:32:29.5409783Z 2025-05-07T20:32:29.5410006Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:29.5410524Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:29.5410985Z module_map=module_map) 2025-05-07T20:32:29.5411342Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:29.5411682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:29.5411939Z E ^ 2025-05-07T20:32:29.5412391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:29.5412831Z 2025-05-07T20:32:29.5413238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2369010Z 2025-05-07T20:32:30.2369533Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2370177Z self=, 2025-05-07T20:32:30.2370721Z T=1, 2025-05-07T20:32:30.2370965Z D=7168, 2025-05-07T20:32:30.2371213Z scale_ub=None, 2025-05-07T20:32:30.2371483Z contiguous=True, 2025-05-07T20:32:30.2371753Z compiled=True, 2025-05-07T20:32:30.2371969Z ) 2025-05-07T20:32:30.2372288Z self = 2025-05-07T20:32:30.2372778Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:30.2373042Z 2025-05-07T20:32:30.2373124Z @given( 2025-05-07T20:32:30.2373364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2374072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2374385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2374717Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2375042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2375340Z ) 2025-05-07T20:32:30.2375694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2376139Z def test_silu_mul_quant( 2025-05-07T20:32:30.2376378Z self, 2025-05-07T20:32:30.2376684Z T: int, 2025-05-07T20:32:30.2376881Z D: int, 2025-05-07T20:32:30.2377099Z scale_ub: Optional[float], 2025-05-07T20:32:30.2377371Z contiguous: bool, 2025-05-07T20:32:30.2377611Z compiled: bool, 2025-05-07T20:32:30.2377835Z ) -> None: 2025-05-07T20:32:30.2378054Z torch.manual_seed(2025) 2025-05-07T20:32:30.2378301Z 2025-05-07T20:32:30.2378571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2378925Z 2025-05-07T20:32:30.2379125Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2379411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2379727Z x = x_sign * x_clamp 2025-05-07T20:32:30.2380002Z x0 = x[:, :D] 2025-05-07T20:32:30.2380315Z x1 = x[:, D:] 2025-05-07T20:32:30.2380533Z 2025-05-07T20:32:30.2380720Z if contiguous: 2025-05-07T20:32:30.2380955Z x0 = x0.contiguous() 2025-05-07T20:32:30.2381218Z x1 = x1.contiguous() 2025-05-07T20:32:30.2381460Z 2025-05-07T20:32:30.2381656Z if scale_ub is not None: 2025-05-07T20:32:30.2381932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2382262Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2382570Z ) 2025-05-07T20:32:30.2382767Z else: 2025-05-07T20:32:30.2382975Z scale_ub_tensor = None 2025-05-07T20:32:30.2383231Z 2025-05-07T20:32:30.2383491Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2383832Z op = silu_mul_quant 2025-05-07T20:32:30.2384078Z if compiled: 2025-05-07T20:32:30.2384335Z op = torch.compile(op) 2025-05-07T20:32:30.2384639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2384915Z 2025-05-07T20:32:30.2385198Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.2385487Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.2385779Z 2025-05-07T20:32:30.2386022Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2386361Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.2386654Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.2386968Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.2387329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2387641Z 2025-05-07T20:32:30.2387845Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.2388046Z 2025-05-07T20:32:30.2388151Z moe/activation_test.py:126: 2025-05-07T20:32:30.2388452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2388796Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.2389130Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.2389922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.2390671Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.2391212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2391889Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2392572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.2393339Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.2394051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.2394690Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.2395290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.2395804Z fn() 2025-05-07T20:32:30.2396355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.2396934Z self.fn.run( 2025-05-07T20:32:30.2397402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2397924Z kernel = self.compile( 2025-05-07T20:32:30.2398754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2399411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2399805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2400037Z 2025-05-07T20:32:30.2400321Z self = 2025-05-07T20:32:30.2401397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2402760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411a340>} 2025-05-07T20:32:30.2404080Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2405084Z context = 2025-05-07T20:32:30.2405371Z 2025-05-07T20:32:30.2405536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2406120Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2406588Z module_map=module_map) 2025-05-07T20:32:30.2406951Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2407309Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.2407581Z E ^ 2025-05-07T20:32:30.2408035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2408478Z 2025-05-07T20:32:30.2408886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.2409400Z 2025-05-07T20:32:30.2409503Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.2409917Z self=, 2025-05-07T20:32:30.2410316Z T=4096, 2025-05-07T20:32:30.2410503Z D=5120, 2025-05-07T20:32:30.2410703Z scale_ub=None, 2025-05-07T20:32:30.2410915Z contiguous=False, 2025-05-07T20:32:30.2411146Z compiled=False, 2025-05-07T20:32:30.2411360Z ) 2025-05-07T20:32:30.2411671Z self = 2025-05-07T20:32:30.2412166Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.2412442Z 2025-05-07T20:32:30.2412522Z @given( 2025-05-07T20:32:30.2412755Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.2413064Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.2413376Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.2413914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.2414238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.2414529Z ) 2025-05-07T20:32:30.2414878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.2415313Z def test_silu_mul_quant( 2025-05-07T20:32:30.2415560Z self, 2025-05-07T20:32:30.2415762Z T: int, 2025-05-07T20:32:30.2415954Z D: int, 2025-05-07T20:32:30.2416172Z scale_ub: Optional[float], 2025-05-07T20:32:30.2416515Z contiguous: bool, 2025-05-07T20:32:30.2416753Z compiled: bool, 2025-05-07T20:32:30.2416973Z ) -> None: 2025-05-07T20:32:30.2417185Z torch.manual_seed(2025) 2025-05-07T20:32:30.2417428Z 2025-05-07T20:32:30.2417694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.2418043Z 2025-05-07T20:32:30.2418239Z x_sign = torch.sign(x) 2025-05-07T20:32:30.2418525Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.2418842Z x = x_sign * x_clamp 2025-05-07T20:32:30.2419081Z x0 = x[:, :D] 2025-05-07T20:32:30.2419291Z x1 = x[:, D:] 2025-05-07T20:32:30.2419500Z 2025-05-07T20:32:30.2419688Z if contiguous: 2025-05-07T20:32:30.2419965Z x0 = x0.contiguous() 2025-05-07T20:32:30.2420228Z x1 = x1.contiguous() 2025-05-07T20:32:30.2420468Z 2025-05-07T20:32:30.2420656Z if scale_ub is not None: 2025-05-07T20:32:30.2420929Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.2421263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.2421566Z ) 2025-05-07T20:32:30.2421761Z else: 2025-05-07T20:32:30.2421972Z scale_ub_tensor = None 2025-05-07T20:32:30.2422223Z 2025-05-07T20:32:30.2422447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.2422762Z op = silu_mul_quant 2025-05-07T20:32:30.2423016Z if compiled: 2025-05-07T20:32:30.2423261Z op = torch.compile(op) 2025-05-07T20:32:30.2423564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2423841Z 2025-05-07T20:32:30.2424028Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.2424193Z 2025-05-07T20:32:30.2424294Z moe/activation_test.py:117: 2025-05-07T20:32:30.2424640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2424966Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.2425250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.2425932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.2426609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.2427136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.2427808Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.2428471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.2428991Z kernel = self.compile( 2025-05-07T20:32:30.2429534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.2430186Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.2430588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.2430815Z 2025-05-07T20:32:30.2431024Z self = 2025-05-07T20:32:30.2432090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.2433490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411b100>} 2025-05-07T20:32:30.2434817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.2435826Z context = 2025-05-07T20:32:30.2436185Z 2025-05-07T20:32:30.2436349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.2436870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.2437335Z module_map=module_map) 2025-05-07T20:32:30.2437693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.2438050Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.2438318Z E ^ 2025-05-07T20:32:30.2438782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.2439224Z 2025-05-07T20:32:30.2439673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9493847Z 2025-05-07T20:32:30.9494255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9494866Z self=, 2025-05-07T20:32:30.9495285Z T=4096, 2025-05-07T20:32:30.9495488Z D=7168, 2025-05-07T20:32:30.9495689Z scale_ub=None, 2025-05-07T20:32:30.9495908Z contiguous=False, 2025-05-07T20:32:30.9496146Z compiled=False, 2025-05-07T20:32:30.9496366Z ) 2025-05-07T20:32:30.9496688Z self = 2025-05-07T20:32:30.9497191Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:30.9497467Z 2025-05-07T20:32:30.9497556Z @given( 2025-05-07T20:32:30.9497790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9498114Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9498675Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9499278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9499609Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9499900Z ) 2025-05-07T20:32:30.9500251Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9500687Z def test_silu_mul_quant( 2025-05-07T20:32:30.9500934Z self, 2025-05-07T20:32:30.9501137Z T: int, 2025-05-07T20:32:30.9501333Z D: int, 2025-05-07T20:32:30.9501551Z scale_ub: Optional[float], 2025-05-07T20:32:30.9501827Z contiguous: bool, 2025-05-07T20:32:30.9502060Z compiled: bool, 2025-05-07T20:32:30.9502295Z ) -> None: 2025-05-07T20:32:30.9502517Z torch.manual_seed(2025) 2025-05-07T20:32:30.9502755Z 2025-05-07T20:32:30.9503030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9503378Z 2025-05-07T20:32:30.9503579Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9503917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9504234Z x = x_sign * x_clamp 2025-05-07T20:32:30.9504478Z x0 = x[:, :D] 2025-05-07T20:32:30.9504693Z x1 = x[:, D:] 2025-05-07T20:32:30.9504906Z 2025-05-07T20:32:30.9505094Z if contiguous: 2025-05-07T20:32:30.9505322Z x0 = x0.contiguous() 2025-05-07T20:32:30.9505580Z x1 = x1.contiguous() 2025-05-07T20:32:30.9505825Z 2025-05-07T20:32:30.9506016Z if scale_ub is not None: 2025-05-07T20:32:30.9506294Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9506631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9507023Z ) 2025-05-07T20:32:30.9507222Z else: 2025-05-07T20:32:30.9507440Z scale_ub_tensor = None 2025-05-07T20:32:30.9507689Z 2025-05-07T20:32:30.9507929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9508252Z op = silu_mul_quant 2025-05-07T20:32:30.9508499Z if compiled: 2025-05-07T20:32:30.9508754Z op = torch.compile(op) 2025-05-07T20:32:30.9509058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9509433Z 2025-05-07T20:32:30.9509625Z > y_fp8, y_scale = fn() 2025-05-07T20:32:30.9509792Z 2025-05-07T20:32:30.9509894Z moe/activation_test.py:117: 2025-05-07T20:32:30.9510192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9510522Z moe/activation_test.py:115: in fn 2025-05-07T20:32:30.9510810Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9511499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:30.9512179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:30.9512716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9513487Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9514151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9514681Z kernel = self.compile( 2025-05-07T20:32:30.9515221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9515875Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9516276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9516503Z 2025-05-07T20:32:30.9516712Z self = 2025-05-07T20:32:30.9517782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9519191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411a840>} 2025-05-07T20:32:30.9520531Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9521542Z context = 2025-05-07T20:32:30.9521827Z 2025-05-07T20:32:30.9521993Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9522513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9522983Z module_map=module_map) 2025-05-07T20:32:30.9523346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9531283Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:30.9531545Z E ^ 2025-05-07T20:32:30.9532023Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9532463Z 2025-05-07T20:32:30.9532871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:30.9533376Z 2025-05-07T20:32:30.9533480Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:30.9533984Z self=, 2025-05-07T20:32:30.9534382Z T=128, 2025-05-07T20:32:30.9534562Z D=7168, 2025-05-07T20:32:30.9534841Z scale_ub=None, 2025-05-07T20:32:30.9535066Z contiguous=False, 2025-05-07T20:32:30.9535285Z compiled=True, 2025-05-07T20:32:30.9535491Z ) 2025-05-07T20:32:30.9535810Z self = 2025-05-07T20:32:30.9536291Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:30.9536560Z 2025-05-07T20:32:30.9536642Z @given( 2025-05-07T20:32:30.9536877Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:30.9537192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:30.9537549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:30.9537879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:30.9538209Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:30.9538490Z ) 2025-05-07T20:32:30.9538842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:30.9539284Z def test_silu_mul_quant( 2025-05-07T20:32:30.9539527Z self, 2025-05-07T20:32:30.9539726Z T: int, 2025-05-07T20:32:30.9539920Z D: int, 2025-05-07T20:32:30.9540132Z scale_ub: Optional[float], 2025-05-07T20:32:30.9540405Z contiguous: bool, 2025-05-07T20:32:30.9540646Z compiled: bool, 2025-05-07T20:32:30.9540913Z ) -> None: 2025-05-07T20:32:30.9541132Z torch.manual_seed(2025) 2025-05-07T20:32:30.9541379Z 2025-05-07T20:32:30.9541643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:30.9541991Z 2025-05-07T20:32:30.9542187Z x_sign = torch.sign(x) 2025-05-07T20:32:30.9542474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:30.9542773Z x = x_sign * x_clamp 2025-05-07T20:32:30.9543011Z x0 = x[:, :D] 2025-05-07T20:32:30.9543227Z x1 = x[:, D:] 2025-05-07T20:32:30.9543426Z 2025-05-07T20:32:30.9543612Z if contiguous: 2025-05-07T20:32:30.9543843Z x0 = x0.contiguous() 2025-05-07T20:32:30.9544092Z x1 = x1.contiguous() 2025-05-07T20:32:30.9544330Z 2025-05-07T20:32:30.9544519Z if scale_ub is not None: 2025-05-07T20:32:30.9544787Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:30.9545120Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:30.9545414Z ) 2025-05-07T20:32:30.9545655Z else: 2025-05-07T20:32:30.9545866Z scale_ub_tensor = None 2025-05-07T20:32:30.9546112Z 2025-05-07T20:32:30.9546346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9546658Z op = silu_mul_quant 2025-05-07T20:32:30.9546901Z if compiled: 2025-05-07T20:32:30.9547146Z op = torch.compile(op) 2025-05-07T20:32:30.9547439Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:30.9547707Z 2025-05-07T20:32:30.9547897Z y_fp8, y_scale = fn() 2025-05-07T20:32:30.9548179Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:30.9548465Z 2025-05-07T20:32:30.9548699Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:30.9549029Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:30.9549316Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:30.9549622Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:30.9549977Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.9550282Z 2025-05-07T20:32:30.9550473Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:30.9550672Z 2025-05-07T20:32:30.9550769Z moe/activation_test.py:126: 2025-05-07T20:32:30.9551059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9551384Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:30.9551700Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:30.9552470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:30.9553262Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:30.9553845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:30.9554518Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:30.9555189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:30.9555943Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:30.9556648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:30.9557274Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:30.9557866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:30.9558370Z fn() 2025-05-07T20:32:30.9558867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:30.9559436Z self.fn.run( 2025-05-07T20:32:30.9559938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:30.9560452Z kernel = self.compile( 2025-05-07T20:32:30.9560985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:30.9561625Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:30.9562012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:30.9562243Z 2025-05-07T20:32:30.9562448Z self = 2025-05-07T20:32:30.9563512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:30.9564911Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf7bf060>} 2025-05-07T20:32:30.9566282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:30.9567282Z context = 2025-05-07T20:32:30.9567571Z 2025-05-07T20:32:30.9567735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:30.9568253Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:30.9568715Z module_map=module_map) 2025-05-07T20:32:30.9569074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:30.9569429Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:30.9569693Z E ^ 2025-05-07T20:32:30.9570140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:30.9570588Z 2025-05-07T20:32:30.9570997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1958701Z 2025-05-07T20:32:31.1958944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1959389Z self=, 2025-05-07T20:32:31.1959818Z T=128, 2025-05-07T20:32:31.1960010Z D=7168, 2025-05-07T20:32:31.1960244Z scale_ub=None, 2025-05-07T20:32:31.1960458Z contiguous=False, 2025-05-07T20:32:31.1960688Z compiled=False, 2025-05-07T20:32:31.1961161Z ) 2025-05-07T20:32:31.1961490Z self = 2025-05-07T20:32:31.1961983Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:31.1962246Z 2025-05-07T20:32:31.1962326Z @given( 2025-05-07T20:32:31.1962559Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1962877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1963172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1963498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1963942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1964239Z ) 2025-05-07T20:32:31.1964586Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1965024Z def test_silu_mul_quant( 2025-05-07T20:32:31.1965265Z self, 2025-05-07T20:32:31.1965450Z T: int, 2025-05-07T20:32:31.1965640Z D: int, 2025-05-07T20:32:31.1965855Z scale_ub: Optional[float], 2025-05-07T20:32:31.1966115Z contiguous: bool, 2025-05-07T20:32:31.1966350Z compiled: bool, 2025-05-07T20:32:31.1966570Z ) -> None: 2025-05-07T20:32:31.1966776Z torch.manual_seed(2025) 2025-05-07T20:32:31.1967055Z 2025-05-07T20:32:31.1967405Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1967748Z 2025-05-07T20:32:31.1967939Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1968225Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1968534Z x = x_sign * x_clamp 2025-05-07T20:32:31.1968771Z x0 = x[:, :D] 2025-05-07T20:32:31.1968978Z x1 = x[:, D:] 2025-05-07T20:32:31.1969186Z 2025-05-07T20:32:31.1969374Z if contiguous: 2025-05-07T20:32:31.1969598Z x0 = x0.contiguous() 2025-05-07T20:32:31.1969854Z x1 = x1.contiguous() 2025-05-07T20:32:31.1970096Z 2025-05-07T20:32:31.1970280Z if scale_ub is not None: 2025-05-07T20:32:31.1970555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.1970886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.1971189Z ) 2025-05-07T20:32:31.1971382Z else: 2025-05-07T20:32:31.1971594Z scale_ub_tensor = None 2025-05-07T20:32:31.1971838Z 2025-05-07T20:32:31.1972150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.1972464Z op = silu_mul_quant 2025-05-07T20:32:31.1972715Z if compiled: 2025-05-07T20:32:31.1972957Z op = torch.compile(op) 2025-05-07T20:32:31.1973249Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1973519Z 2025-05-07T20:32:31.1973790Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.1973957Z 2025-05-07T20:32:31.1974054Z moe/activation_test.py:117: 2025-05-07T20:32:31.1974350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1974676Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.1974954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.1975636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.1976322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.1976851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.1977525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.1978181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.1978701Z kernel = self.compile( 2025-05-07T20:32:31.1979240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.1979887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.1980332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.1980558Z 2025-05-07T20:32:31.1980761Z self = 2025-05-07T20:32:31.1981832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.1983222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf909e40>} 2025-05-07T20:32:31.1984543Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.1985544Z context = 2025-05-07T20:32:31.1985826Z 2025-05-07T20:32:31.1985989Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.1986507Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.1987010Z module_map=module_map) 2025-05-07T20:32:31.1987373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.1987722Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.1987983Z E ^ 2025-05-07T20:32:31.1988437Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.1988874Z 2025-05-07T20:32:31.1989279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.1989788Z 2025-05-07T20:32:31.1989892Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.1990303Z self=, 2025-05-07T20:32:31.1990702Z T=4096, 2025-05-07T20:32:31.1990897Z D=5120, 2025-05-07T20:32:31.1991084Z scale_ub=1200.0, 2025-05-07T20:32:31.1991305Z contiguous=True, 2025-05-07T20:32:31.1991527Z compiled=False, 2025-05-07T20:32:31.1991731Z ) 2025-05-07T20:32:31.1992090Z self = 2025-05-07T20:32:31.1992580Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:31.1992853Z 2025-05-07T20:32:31.1992938Z @given( 2025-05-07T20:32:31.1993161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.1993472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.1993812Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.1994146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.1994471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.1994759Z ) 2025-05-07T20:32:31.1995098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.1995536Z def test_silu_mul_quant( 2025-05-07T20:32:31.1995778Z self, 2025-05-07T20:32:31.1995963Z T: int, 2025-05-07T20:32:31.1996160Z D: int, 2025-05-07T20:32:31.1996378Z scale_ub: Optional[float], 2025-05-07T20:32:31.1996641Z contiguous: bool, 2025-05-07T20:32:31.1996875Z compiled: bool, 2025-05-07T20:32:31.1997097Z ) -> None: 2025-05-07T20:32:31.1997307Z torch.manual_seed(2025) 2025-05-07T20:32:31.1997537Z 2025-05-07T20:32:31.1997801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.1998137Z 2025-05-07T20:32:31.1998488Z x_sign = torch.sign(x) 2025-05-07T20:32:31.1998782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.1999093Z x = x_sign * x_clamp 2025-05-07T20:32:31.1999325Z x0 = x[:, :D] 2025-05-07T20:32:31.1999615Z x1 = x[:, D:] 2025-05-07T20:32:31.1999825Z 2025-05-07T20:32:31.2000007Z if contiguous: 2025-05-07T20:32:31.2000239Z x0 = x0.contiguous() 2025-05-07T20:32:31.2000497Z x1 = x1.contiguous() 2025-05-07T20:32:31.2000730Z 2025-05-07T20:32:31.2000926Z if scale_ub is not None: 2025-05-07T20:32:31.2001202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2001529Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2001904Z ) 2025-05-07T20:32:31.2002102Z else: 2025-05-07T20:32:31.2002313Z scale_ub_tensor = None 2025-05-07T20:32:31.2002561Z 2025-05-07T20:32:31.2002795Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2003110Z op = silu_mul_quant 2025-05-07T20:32:31.2003356Z if compiled: 2025-05-07T20:32:31.2003609Z op = torch.compile(op) 2025-05-07T20:32:31.2003954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2004233Z 2025-05-07T20:32:31.2004426Z > y_fp8, y_scale = fn() 2025-05-07T20:32:31.2004589Z 2025-05-07T20:32:31.2004696Z moe/activation_test.py:117: 2025-05-07T20:32:31.2005051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2005384Z moe/activation_test.py:115: in fn 2025-05-07T20:32:31.2005666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2006350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:31.2007033Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:31.2007566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2008242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2008893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2009427Z kernel = self.compile( 2025-05-07T20:32:31.2009967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2010619Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2011072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2011309Z 2025-05-07T20:32:31.2011514Z self = 2025-05-07T20:32:31.2012588Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2014001Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf90a5c0>} 2025-05-07T20:32:31.2015324Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2016335Z context = 2025-05-07T20:32:31.2016624Z 2025-05-07T20:32:31.2016790Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2017306Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2017763Z module_map=module_map) 2025-05-07T20:32:31.2018126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2018479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:31.2018740Z E ^ 2025-05-07T20:32:31.2019191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2019686Z 2025-05-07T20:32:31.2020095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.2020599Z 2025-05-07T20:32:31.2020711Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.2021124Z self=, 2025-05-07T20:32:31.2021521Z T=1, 2025-05-07T20:32:31.2021706Z D=5120, 2025-05-07T20:32:31.2021900Z scale_ub=None, 2025-05-07T20:32:31.2022153Z contiguous=True, 2025-05-07T20:32:31.2022377Z compiled=True, 2025-05-07T20:32:31.2022578Z ) 2025-05-07T20:32:31.2022889Z self = 2025-05-07T20:32:31.2023364Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.2023619Z 2025-05-07T20:32:31.2023703Z @given( 2025-05-07T20:32:31.2023929Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.2024245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.2024556Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.2024887Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.2025284Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.2025570Z ) 2025-05-07T20:32:31.2025922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.2026357Z def test_silu_mul_quant( 2025-05-07T20:32:31.2026605Z self, 2025-05-07T20:32:31.2026804Z T: int, 2025-05-07T20:32:31.2026995Z D: int, 2025-05-07T20:32:31.2027214Z scale_ub: Optional[float], 2025-05-07T20:32:31.2027487Z contiguous: bool, 2025-05-07T20:32:31.2027722Z compiled: bool, 2025-05-07T20:32:31.2027950Z ) -> None: 2025-05-07T20:32:31.2028170Z torch.manual_seed(2025) 2025-05-07T20:32:31.2028404Z 2025-05-07T20:32:31.2028675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.2029019Z 2025-05-07T20:32:31.2029210Z x_sign = torch.sign(x) 2025-05-07T20:32:31.2029504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.2029816Z x = x_sign * x_clamp 2025-05-07T20:32:31.2030062Z x0 = x[:, :D] 2025-05-07T20:32:31.2030322Z x1 = x[:, D:] 2025-05-07T20:32:31.2030532Z 2025-05-07T20:32:31.2030719Z if contiguous: 2025-05-07T20:32:31.2030945Z x0 = x0.contiguous() 2025-05-07T20:32:31.2031207Z x1 = x1.contiguous() 2025-05-07T20:32:31.2031448Z 2025-05-07T20:32:31.2031638Z if scale_ub is not None: 2025-05-07T20:32:31.2031910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.2032246Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.2032547Z ) 2025-05-07T20:32:31.2032745Z else: 2025-05-07T20:32:31.2032957Z scale_ub_tensor = None 2025-05-07T20:32:31.2033203Z 2025-05-07T20:32:31.2033435Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2033760Z op = silu_mul_quant 2025-05-07T20:32:31.2034048Z if compiled: 2025-05-07T20:32:31.2034294Z op = torch.compile(op) 2025-05-07T20:32:31.2034592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.2034873Z 2025-05-07T20:32:31.2035061Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.2035345Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.2035639Z 2025-05-07T20:32:31.2035873Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.2036214Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.2036504Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.2036814Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.2037176Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2037541Z 2025-05-07T20:32:31.2037741Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.2037939Z 2025-05-07T20:32:31.2038040Z moe/activation_test.py:126: 2025-05-07T20:32:31.2038336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2038673Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.2038999Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.2039778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.2040568Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.2041107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.2041782Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.2042465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.2043182Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.2043986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.2044620Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.2045221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.2045740Z fn() 2025-05-07T20:32:31.2046238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.2046818Z self.fn.run( 2025-05-07T20:32:31.2047288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.2047808Z kernel = self.compile( 2025-05-07T20:32:31.2048346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.2048993Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.2049393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.2049623Z 2025-05-07T20:32:31.2049890Z self = 2025-05-07T20:32:31.2050957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.2052308Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf90b240>} 2025-05-07T20:32:31.2053634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.2054726Z context = 2025-05-07T20:32:31.2055017Z 2025-05-07T20:32:31.2055187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.2055706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.2056171Z module_map=module_map) 2025-05-07T20:32:31.2056531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.2056889Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.2057161Z E ^ 2025-05-07T20:32:31.2057610Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.2058056Z 2025-05-07T20:32:31.2058464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.9062967Z 2025-05-07T20:32:31.9063637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.9064338Z self=, 2025-05-07T20:32:31.9064898Z T=2048, 2025-05-07T20:32:31.9065174Z D=5120, 2025-05-07T20:32:31.9073215Z scale_ub=None, 2025-05-07T20:32:31.9073571Z contiguous=True, 2025-05-07T20:32:31.9073872Z compiled=True, 2025-05-07T20:32:31.9074123Z ) 2025-05-07T20:32:31.9074589Z self = 2025-05-07T20:32:31.9075084Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.9075362Z 2025-05-07T20:32:31.9075443Z @given( 2025-05-07T20:32:31.9075682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.9075991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.9076287Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.9076620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.9076946Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.9077222Z ) 2025-05-07T20:32:31.9077644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.9078087Z def test_silu_mul_quant( 2025-05-07T20:32:31.9078330Z self, 2025-05-07T20:32:31.9078525Z T: int, 2025-05-07T20:32:31.9078721Z D: int, 2025-05-07T20:32:31.9078929Z scale_ub: Optional[float], 2025-05-07T20:32:31.9079202Z contiguous: bool, 2025-05-07T20:32:31.9079441Z compiled: bool, 2025-05-07T20:32:31.9079672Z ) -> None: 2025-05-07T20:32:31.9079880Z torch.manual_seed(2025) 2025-05-07T20:32:31.9080121Z 2025-05-07T20:32:31.9080396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.9080730Z 2025-05-07T20:32:31.9080923Z x_sign = torch.sign(x) 2025-05-07T20:32:31.9081213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.9081514Z x = x_sign * x_clamp 2025-05-07T20:32:31.9081754Z x0 = x[:, :D] 2025-05-07T20:32:31.9081971Z x1 = x[:, D:] 2025-05-07T20:32:31.9082173Z 2025-05-07T20:32:31.9082363Z if contiguous: 2025-05-07T20:32:31.9082594Z x0 = x0.contiguous() 2025-05-07T20:32:31.9082920Z x1 = x1.contiguous() 2025-05-07T20:32:31.9083165Z 2025-05-07T20:32:31.9083356Z if scale_ub is not None: 2025-05-07T20:32:31.9083622Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.9083956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.9084268Z ) 2025-05-07T20:32:31.9084461Z else: 2025-05-07T20:32:31.9084659Z scale_ub_tensor = None 2025-05-07T20:32:31.9084908Z 2025-05-07T20:32:31.9085138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9085444Z op = silu_mul_quant 2025-05-07T20:32:31.9085685Z if compiled: 2025-05-07T20:32:31.9085933Z op = torch.compile(op) 2025-05-07T20:32:31.9086231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9086495Z 2025-05-07T20:32:31.9086684Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.9086970Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.9087261Z 2025-05-07T20:32:31.9087491Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9087825Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.9088117Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.9088420Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.9088777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.9089087Z 2025-05-07T20:32:31.9089279Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.9089502Z 2025-05-07T20:32:31.9089609Z moe/activation_test.py:126: 2025-05-07T20:32:31.9089971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9090306Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.9090630Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.9091415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.9092154Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.9092691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.9093415Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.9094240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.9094964Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.9095679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.9096303Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.9096934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.9097452Z fn() 2025-05-07T20:32:31.9097953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.9098703Z self.fn.run( 2025-05-07T20:32:31.9099158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.9099675Z kernel = self.compile( 2025-05-07T20:32:31.9100205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.9100838Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.9101235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9101458Z 2025-05-07T20:32:31.9101669Z self = 2025-05-07T20:32:31.9102848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.9104251Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf9f2d40>} 2025-05-07T20:32:31.9105568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.9106572Z context = 2025-05-07T20:32:31.9106852Z 2025-05-07T20:32:31.9107025Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.9107535Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.9107997Z module_map=module_map) 2025-05-07T20:32:31.9108358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.9108710Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.9108976Z E ^ 2025-05-07T20:32:31.9109428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.9109866Z 2025-05-07T20:32:31.9110279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:31.9110780Z 2025-05-07T20:32:31.9110886Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:31.9111352Z self=, 2025-05-07T20:32:31.9111749Z T=128, 2025-05-07T20:32:31.9111937Z D=5120, 2025-05-07T20:32:31.9112126Z scale_ub=None, 2025-05-07T20:32:31.9112337Z contiguous=True, 2025-05-07T20:32:31.9112559Z compiled=True, 2025-05-07T20:32:31.9112755Z ) 2025-05-07T20:32:31.9113073Z self = 2025-05-07T20:32:31.9113558Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:31.9113884Z 2025-05-07T20:32:31.9113959Z @given( 2025-05-07T20:32:31.9114191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:31.9114495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:31.9114800Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:31.9115121Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:31.9115445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:31.9115734Z ) 2025-05-07T20:32:31.9116071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:31.9116509Z def test_silu_mul_quant( 2025-05-07T20:32:31.9116749Z self, 2025-05-07T20:32:31.9116935Z T: int, 2025-05-07T20:32:31.9117196Z D: int, 2025-05-07T20:32:31.9117413Z scale_ub: Optional[float], 2025-05-07T20:32:31.9117671Z contiguous: bool, 2025-05-07T20:32:31.9117907Z compiled: bool, 2025-05-07T20:32:31.9118125Z ) -> None: 2025-05-07T20:32:31.9118327Z torch.manual_seed(2025) 2025-05-07T20:32:31.9118564Z 2025-05-07T20:32:31.9118827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:31.9119166Z 2025-05-07T20:32:31.9119354Z x_sign = torch.sign(x) 2025-05-07T20:32:31.9119637Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:31.9119941Z x = x_sign * x_clamp 2025-05-07T20:32:31.9120173Z x0 = x[:, :D] 2025-05-07T20:32:31.9120538Z x1 = x[:, D:] 2025-05-07T20:32:31.9120741Z 2025-05-07T20:32:31.9120917Z if contiguous: 2025-05-07T20:32:31.9121145Z x0 = x0.contiguous() 2025-05-07T20:32:31.9121400Z x1 = x1.contiguous() 2025-05-07T20:32:31.9121630Z 2025-05-07T20:32:31.9121815Z if scale_ub is not None: 2025-05-07T20:32:31.9122132Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:31.9122454Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:31.9122758Z ) 2025-05-07T20:32:31.9122944Z else: 2025-05-07T20:32:31.9123145Z scale_ub_tensor = None 2025-05-07T20:32:31.9123391Z 2025-05-07T20:32:31.9123619Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9123922Z op = silu_mul_quant 2025-05-07T20:32:31.9124206Z if compiled: 2025-05-07T20:32:31.9124454Z op = torch.compile(op) 2025-05-07T20:32:31.9124748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:31.9125014Z 2025-05-07T20:32:31.9125200Z y_fp8, y_scale = fn() 2025-05-07T20:32:31.9125480Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:31.9125761Z 2025-05-07T20:32:31.9125992Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:31.9126320Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:31.9126595Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:31.9126901Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:31.9127252Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.9127549Z 2025-05-07T20:32:31.9127746Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:31.9127939Z 2025-05-07T20:32:31.9128032Z moe/activation_test.py:126: 2025-05-07T20:32:31.9128317Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9128641Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:31.9129010Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:31.9129777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:31.9130504Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:31.9131036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:31.9131705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:31.9132423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:31.9133125Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:31.9133899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:31.9134576Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:31.9135168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:31.9135671Z fn() 2025-05-07T20:32:31.9136212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:31.9136777Z self.fn.run( 2025-05-07T20:32:31.9137227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:31.9137747Z kernel = self.compile( 2025-05-07T20:32:31.9138275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:31.9138910Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:31.9139294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:31.9139525Z 2025-05-07T20:32:31.9139729Z self = 2025-05-07T20:32:31.9140786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:31.9142176Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf9f0680>} 2025-05-07T20:32:31.9143485Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:31.9144482Z context = 2025-05-07T20:32:31.9144764Z 2025-05-07T20:32:31.9144927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:31.9145438Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:31.9145890Z module_map=module_map) 2025-05-07T20:32:31.9146254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:31.9146608Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:31.9146865Z E ^ 2025-05-07T20:32:31.9147325Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:31.9147763Z 2025-05-07T20:32:31.9148170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.6993199Z 2025-05-07T20:32:32.6993427Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.6994054Z self=, 2025-05-07T20:32:32.6994670Z T=4096, 2025-05-07T20:32:32.6995088Z D=5120, 2025-05-07T20:32:32.6995355Z scale_ub=None, 2025-05-07T20:32:32.6995645Z contiguous=True, 2025-05-07T20:32:32.6995879Z compiled=True, 2025-05-07T20:32:32.6996095Z ) 2025-05-07T20:32:32.6996418Z self = 2025-05-07T20:32:32.6996930Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.6997196Z 2025-05-07T20:32:32.6997285Z @given( 2025-05-07T20:32:32.6997517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.6997924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.6998469Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.6998808Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.6999141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.6999438Z ) 2025-05-07T20:32:32.6999799Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7000246Z def test_silu_mul_quant( 2025-05-07T20:32:32.7000496Z self, 2025-05-07T20:32:32.7000699Z T: int, 2025-05-07T20:32:32.7000900Z D: int, 2025-05-07T20:32:32.7001126Z scale_ub: Optional[float], 2025-05-07T20:32:32.7001406Z contiguous: bool, 2025-05-07T20:32:32.7001724Z compiled: bool, 2025-05-07T20:32:32.7001963Z ) -> None: 2025-05-07T20:32:32.7002196Z torch.manual_seed(2025) 2025-05-07T20:32:32.7002510Z 2025-05-07T20:32:32.7002827Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7003177Z 2025-05-07T20:32:32.7003371Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7003665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7003980Z x = x_sign * x_clamp 2025-05-07T20:32:32.7004220Z x0 = x[:, :D] 2025-05-07T20:32:32.7004442Z x1 = x[:, D:] 2025-05-07T20:32:32.7004657Z 2025-05-07T20:32:32.7004854Z if contiguous: 2025-05-07T20:32:32.7005085Z x0 = x0.contiguous() 2025-05-07T20:32:32.7005345Z x1 = x1.contiguous() 2025-05-07T20:32:32.7005587Z 2025-05-07T20:32:32.7005779Z if scale_ub is not None: 2025-05-07T20:32:32.7006052Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7006390Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7006776Z ) 2025-05-07T20:32:32.7006978Z else: 2025-05-07T20:32:32.7007190Z scale_ub_tensor = None 2025-05-07T20:32:32.7007438Z 2025-05-07T20:32:32.7007672Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7007986Z op = silu_mul_quant 2025-05-07T20:32:32.7008229Z if compiled: 2025-05-07T20:32:32.7008476Z op = torch.compile(op) 2025-05-07T20:32:32.7008769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7009036Z 2025-05-07T20:32:32.7009225Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7009508Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7009794Z 2025-05-07T20:32:32.7010021Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7010349Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7010639Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7010945Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7011299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7011612Z 2025-05-07T20:32:32.7011810Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7012007Z 2025-05-07T20:32:32.7012105Z moe/activation_test.py:126: 2025-05-07T20:32:32.7012398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7012728Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7013046Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7013915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7014729Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7015265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7015936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7016616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7017390Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7018096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7018721Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7019311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7019826Z fn() 2025-05-07T20:32:32.7020319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7020887Z self.fn.run( 2025-05-07T20:32:32.7021418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7021939Z kernel = self.compile( 2025-05-07T20:32:32.7022475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7023118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7023516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7023741Z 2025-05-07T20:32:32.7023948Z self = 2025-05-07T20:32:32.7025010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7026408Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78becbe520>} 2025-05-07T20:32:32.7027728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7028729Z context = 2025-05-07T20:32:32.7029019Z 2025-05-07T20:32:32.7029184Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7029700Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7030162Z module_map=module_map) 2025-05-07T20:32:32.7030519Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7030877Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7031144Z E ^ 2025-05-07T20:32:32.7031598Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7032041Z 2025-05-07T20:32:32.7032447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7032957Z 2025-05-07T20:32:32.7033060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:32.7033469Z self=, 2025-05-07T20:32:32.7033864Z T=16384, 2025-05-07T20:32:32.7034061Z D=5120, 2025-05-07T20:32:32.7034257Z scale_ub=None, 2025-05-07T20:32:32.7034467Z contiguous=True, 2025-05-07T20:32:32.7034739Z compiled=True, 2025-05-07T20:32:32.7034941Z ) 2025-05-07T20:32:32.7035254Z self = 2025-05-07T20:32:32.7035747Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:32.7036020Z 2025-05-07T20:32:32.7036099Z @given( 2025-05-07T20:32:32.7036335Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:32.7036641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:32.7036947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:32.7037321Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:32.7037641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:32.7037930Z ) 2025-05-07T20:32:32.7038276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:32.7038709Z def test_silu_mul_quant( 2025-05-07T20:32:32.7038950Z self, 2025-05-07T20:32:32.7039144Z T: int, 2025-05-07T20:32:32.7039337Z D: int, 2025-05-07T20:32:32.7039555Z scale_ub: Optional[float], 2025-05-07T20:32:32.7039825Z contiguous: bool, 2025-05-07T20:32:32.7040064Z compiled: bool, 2025-05-07T20:32:32.7040279Z ) -> None: 2025-05-07T20:32:32.7040495Z torch.manual_seed(2025) 2025-05-07T20:32:32.7040782Z 2025-05-07T20:32:32.7041053Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:32.7041391Z 2025-05-07T20:32:32.7041583Z x_sign = torch.sign(x) 2025-05-07T20:32:32.7041872Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:32.7042177Z x = x_sign * x_clamp 2025-05-07T20:32:32.7042411Z x0 = x[:, :D] 2025-05-07T20:32:32.7042617Z x1 = x[:, D:] 2025-05-07T20:32:32.7042822Z 2025-05-07T20:32:32.7043008Z if contiguous: 2025-05-07T20:32:32.7043232Z x0 = x0.contiguous() 2025-05-07T20:32:32.7043489Z x1 = x1.contiguous() 2025-05-07T20:32:32.7043728Z 2025-05-07T20:32:32.7043917Z if scale_ub is not None: 2025-05-07T20:32:32.7044188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:32.7044521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:32.7044825Z ) 2025-05-07T20:32:32.7045014Z else: 2025-05-07T20:32:32.7045226Z scale_ub_tensor = None 2025-05-07T20:32:32.7045519Z 2025-05-07T20:32:32.7045745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7046052Z op = silu_mul_quant 2025-05-07T20:32:32.7046305Z if compiled: 2025-05-07T20:32:32.7046540Z op = torch.compile(op) 2025-05-07T20:32:32.7046833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:32.7047106Z 2025-05-07T20:32:32.7047289Z y_fp8, y_scale = fn() 2025-05-07T20:32:32.7047567Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:32.7047852Z 2025-05-07T20:32:32.7048082Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:32.7048413Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:32.7048701Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:32.7049014Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:32.7049367Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7049674Z 2025-05-07T20:32:32.7049877Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:32.7050068Z 2025-05-07T20:32:32.7050164Z moe/activation_test.py:126: 2025-05-07T20:32:32.7050458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7050789Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:32.7051107Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:32.7051880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:32.7052665Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:32.7053201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:32.7053944Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:32.7054680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:32.7055393Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:32.7056153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:32.7056775Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:32.7057371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:32.7057882Z fn() 2025-05-07T20:32:32.7058382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:32.7058962Z self.fn.run( 2025-05-07T20:32:32.7059426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:32.7059992Z kernel = self.compile( 2025-05-07T20:32:32.7060526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:32.7061170Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:32.7061576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:32.7067074Z 2025-05-07T20:32:32.7067302Z self = 2025-05-07T20:32:32.7068378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:32.7069738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bee302c0>} 2025-05-07T20:32:32.7071137Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:32.7072147Z context = 2025-05-07T20:32:32.7072433Z 2025-05-07T20:32:32.7072599Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:32.7073112Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:32.7073586Z module_map=module_map) 2025-05-07T20:32:32.7073954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:32.7074307Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:32.7074570Z E ^ 2025-05-07T20:32:32.7075025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:32.7075463Z 2025-05-07T20:32:32.7075877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:32.7275691Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:32.7276916Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:32.7278223Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:32.7279312Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:32.7280406Z W0507 20:32:32.726000 276483 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:33.1306690Z 2025-05-07T20:32:33.1306873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1307543Z self=, 2025-05-07T20:32:33.1308011Z T=1, 2025-05-07T20:32:33.1308195Z D=5120, 2025-05-07T20:32:33.1308394Z scale_ub=1200.0, 2025-05-07T20:32:33.1308618Z contiguous=True, 2025-05-07T20:32:33.1308833Z compiled=True, 2025-05-07T20:32:33.1309039Z ) 2025-05-07T20:32:33.1309359Z self = 2025-05-07T20:32:33.1309854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.1310109Z 2025-05-07T20:32:33.1310191Z @given( 2025-05-07T20:32:33.1310425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1310815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1311119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1311451Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1311784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1312069Z ) 2025-05-07T20:32:33.1312425Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1312865Z def test_silu_mul_quant( 2025-05-07T20:32:33.1313104Z self, 2025-05-07T20:32:33.1313299Z T: int, 2025-05-07T20:32:33.1313494Z D: int, 2025-05-07T20:32:33.1313706Z scale_ub: Optional[float], 2025-05-07T20:32:33.1313977Z contiguous: bool, 2025-05-07T20:32:33.1314240Z compiled: bool, 2025-05-07T20:32:33.1314487Z ) -> None: 2025-05-07T20:32:33.1314700Z torch.manual_seed(2025) 2025-05-07T20:32:33.1314942Z 2025-05-07T20:32:33.1315206Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1315547Z 2025-05-07T20:32:33.1315736Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1316090Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1316395Z x = x_sign * x_clamp 2025-05-07T20:32:33.1316632Z x0 = x[:, :D] 2025-05-07T20:32:33.1316847Z x1 = x[:, D:] 2025-05-07T20:32:33.1317052Z 2025-05-07T20:32:33.1317237Z if contiguous: 2025-05-07T20:32:33.1317468Z x0 = x0.contiguous() 2025-05-07T20:32:33.1317720Z x1 = x1.contiguous() 2025-05-07T20:32:33.1317958Z 2025-05-07T20:32:33.1318148Z if scale_ub is not None: 2025-05-07T20:32:33.1318416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1318746Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1319051Z ) 2025-05-07T20:32:33.1319240Z else: 2025-05-07T20:32:33.1319449Z scale_ub_tensor = None 2025-05-07T20:32:33.1319699Z 2025-05-07T20:32:33.1319930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1320247Z op = silu_mul_quant 2025-05-07T20:32:33.1320495Z if compiled: 2025-05-07T20:32:33.1320739Z op = torch.compile(op) 2025-05-07T20:32:33.1321030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1321298Z 2025-05-07T20:32:33.1321489Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.1321650Z 2025-05-07T20:32:33.1321748Z moe/activation_test.py:117: 2025-05-07T20:32:33.1322038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1322367Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.1322638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1323256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.1323809Z return fn(*args, **kwargs) 2025-05-07T20:32:33.1324464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.1325139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.1325666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1326405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1327053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1327586Z kernel = self.compile( 2025-05-07T20:32:33.1328133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1328785Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1329171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1329403Z 2025-05-07T20:32:33.1329706Z self = 2025-05-07T20:32:33.1330831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1332178Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd18680>} 2025-05-07T20:32:33.1333488Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1334686Z context = 2025-05-07T20:32:33.1334976Z 2025-05-07T20:32:33.1335140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1335703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1336159Z module_map=module_map) 2025-05-07T20:32:33.1336523Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1336874Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.1337135Z E ^ 2025-05-07T20:32:33.1337586Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1338030Z 2025-05-07T20:32:33.1338439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.1338942Z 2025-05-07T20:32:33.1339050Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.1339455Z self=, 2025-05-07T20:32:33.1339847Z T=1, 2025-05-07T20:32:33.1340031Z D=5120, 2025-05-07T20:32:33.1340221Z scale_ub=None, 2025-05-07T20:32:33.1340433Z contiguous=False, 2025-05-07T20:32:33.1340658Z compiled=True, 2025-05-07T20:32:33.1340859Z ) 2025-05-07T20:32:33.1341173Z self = 2025-05-07T20:32:33.1341650Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.1341904Z 2025-05-07T20:32:33.1341987Z @given( 2025-05-07T20:32:33.1342209Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.1342522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.1342827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.1343152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.1343528Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.1343805Z ) 2025-05-07T20:32:33.1344150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.1344583Z def test_silu_mul_quant( 2025-05-07T20:32:33.1344826Z self, 2025-05-07T20:32:33.1345017Z T: int, 2025-05-07T20:32:33.1345212Z D: int, 2025-05-07T20:32:33.1345432Z scale_ub: Optional[float], 2025-05-07T20:32:33.1345701Z contiguous: bool, 2025-05-07T20:32:33.1346008Z compiled: bool, 2025-05-07T20:32:33.1346225Z ) -> None: 2025-05-07T20:32:33.1346439Z torch.manual_seed(2025) 2025-05-07T20:32:33.1346672Z 2025-05-07T20:32:33.1346937Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.1347272Z 2025-05-07T20:32:33.1347465Z x_sign = torch.sign(x) 2025-05-07T20:32:33.1347748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.1348053Z x = x_sign * x_clamp 2025-05-07T20:32:33.1348289Z x0 = x[:, :D] 2025-05-07T20:32:33.1348500Z x1 = x[:, D:] 2025-05-07T20:32:33.1348707Z 2025-05-07T20:32:33.1348893Z if contiguous: 2025-05-07T20:32:33.1349122Z x0 = x0.contiguous() 2025-05-07T20:32:33.1349420Z x1 = x1.contiguous() 2025-05-07T20:32:33.1349660Z 2025-05-07T20:32:33.1349853Z if scale_ub is not None: 2025-05-07T20:32:33.1350122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.1350457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.1350765Z ) 2025-05-07T20:32:33.1350952Z else: 2025-05-07T20:32:33.1351163Z scale_ub_tensor = None 2025-05-07T20:32:33.1351410Z 2025-05-07T20:32:33.1351635Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1351946Z op = silu_mul_quant 2025-05-07T20:32:33.1352188Z if compiled: 2025-05-07T20:32:33.1352429Z op = torch.compile(op) 2025-05-07T20:32:33.1352724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.1352997Z 2025-05-07T20:32:33.1353183Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.1353465Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.1353760Z 2025-05-07T20:32:33.1354051Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.1354383Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.1354669Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.1354980Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.1355329Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1355636Z 2025-05-07T20:32:33.1355838Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.1356029Z 2025-05-07T20:32:33.1356124Z moe/activation_test.py:126: 2025-05-07T20:32:33.1356411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1356743Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.1357062Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.1357829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.1358565Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.1359102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.1359767Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.1360443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.1361154Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.1361868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.1362537Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.1363129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.1363639Z fn() 2025-05-07T20:32:33.1364141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.1364759Z self.fn.run( 2025-05-07T20:32:33.1365277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.1365791Z kernel = self.compile( 2025-05-07T20:32:33.1366324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.1366964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.1367353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.1367586Z 2025-05-07T20:32:33.1367791Z self = 2025-05-07T20:32:33.1368899Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.1370244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd02de0>} 2025-05-07T20:32:33.1371570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.1372572Z context = 2025-05-07T20:32:33.1372860Z 2025-05-07T20:32:33.1373023Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.1373534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.1374099Z module_map=module_map) 2025-05-07T20:32:33.1374539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.1374907Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.1375170Z E ^ 2025-05-07T20:32:33.1375617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.1376062Z 2025-05-07T20:32:33.1376466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2772799Z 2025-05-07T20:32:33.2773146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2773580Z self=, 2025-05-07T20:32:33.2774131Z T=1, 2025-05-07T20:32:33.2774370Z D=5120, 2025-05-07T20:32:33.2774665Z scale_ub=None, 2025-05-07T20:32:33.2774960Z contiguous=True, 2025-05-07T20:32:33.2775225Z compiled=False, 2025-05-07T20:32:33.2775428Z ) 2025-05-07T20:32:33.2775755Z self = 2025-05-07T20:32:33.2776243Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:33.2776500Z 2025-05-07T20:32:33.2776584Z @given( 2025-05-07T20:32:33.2776815Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2777129Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2777436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2777759Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2778088Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2778376Z ) 2025-05-07T20:32:33.2778717Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2779275Z def test_silu_mul_quant( 2025-05-07T20:32:33.2779511Z self, 2025-05-07T20:32:33.2779711Z T: int, 2025-05-07T20:32:33.2779907Z D: int, 2025-05-07T20:32:33.2780125Z scale_ub: Optional[float], 2025-05-07T20:32:33.2780394Z contiguous: bool, 2025-05-07T20:32:33.2780630Z compiled: bool, 2025-05-07T20:32:33.2780853Z ) -> None: 2025-05-07T20:32:33.2781068Z torch.manual_seed(2025) 2025-05-07T20:32:33.2781374Z 2025-05-07T20:32:33.2781645Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2781984Z 2025-05-07T20:32:33.2782176Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2782474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2782782Z x = x_sign * x_clamp 2025-05-07T20:32:33.2783011Z x0 = x[:, :D] 2025-05-07T20:32:33.2783222Z x1 = x[:, D:] 2025-05-07T20:32:33.2783435Z 2025-05-07T20:32:33.2783613Z if contiguous: 2025-05-07T20:32:33.2783844Z x0 = x0.contiguous() 2025-05-07T20:32:33.2784101Z x1 = x1.contiguous() 2025-05-07T20:32:33.2784336Z 2025-05-07T20:32:33.2784526Z if scale_ub is not None: 2025-05-07T20:32:33.2784861Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2785197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2785495Z ) 2025-05-07T20:32:33.2785694Z else: 2025-05-07T20:32:33.2785913Z scale_ub_tensor = None 2025-05-07T20:32:33.2786156Z 2025-05-07T20:32:33.2786383Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2786694Z op = silu_mul_quant 2025-05-07T20:32:33.2786939Z if compiled: 2025-05-07T20:32:33.2787183Z op = torch.compile(op) 2025-05-07T20:32:33.2787481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2787746Z 2025-05-07T20:32:33.2787942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2788104Z 2025-05-07T20:32:33.2788205Z moe/activation_test.py:117: 2025-05-07T20:32:33.2788490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2788821Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2789102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2789865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2790548Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2791091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2791763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2792415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2792942Z kernel = self.compile( 2025-05-07T20:32:33.2793479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2794121Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2794541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2794791Z 2025-05-07T20:32:33.2794996Z self = 2025-05-07T20:32:33.2796053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2797398Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be546a20>} 2025-05-07T20:32:33.2799257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2800287Z context = 2025-05-07T20:32:33.2800572Z 2025-05-07T20:32:33.2800744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2801422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2802033Z module_map=module_map) 2025-05-07T20:32:33.2802386Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2802733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2802989Z E ^ 2025-05-07T20:32:33.2803433Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2803881Z 2025-05-07T20:32:33.2804286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2804842Z 2025-05-07T20:32:33.2804942Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2805414Z self=, 2025-05-07T20:32:33.2805801Z T=128, 2025-05-07T20:32:33.2805989Z D=5120, 2025-05-07T20:32:33.2806181Z scale_ub=None, 2025-05-07T20:32:33.2806391Z contiguous=False, 2025-05-07T20:32:33.2806616Z compiled=True, 2025-05-07T20:32:33.2806820Z ) 2025-05-07T20:32:33.2807126Z self = 2025-05-07T20:32:33.2807608Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.2807872Z 2025-05-07T20:32:33.2807952Z @given( 2025-05-07T20:32:33.2808173Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.2808478Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.2808778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.2809098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.2809417Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.2809695Z ) 2025-05-07T20:32:33.2810103Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.2810534Z def test_silu_mul_quant( 2025-05-07T20:32:33.2810775Z self, 2025-05-07T20:32:33.2810963Z T: int, 2025-05-07T20:32:33.2811152Z D: int, 2025-05-07T20:32:33.2811362Z scale_ub: Optional[float], 2025-05-07T20:32:33.2811627Z contiguous: bool, 2025-05-07T20:32:33.2811859Z compiled: bool, 2025-05-07T20:32:33.2812081Z ) -> None: 2025-05-07T20:32:33.2812290Z torch.manual_seed(2025) 2025-05-07T20:32:33.2812523Z 2025-05-07T20:32:33.2812792Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.2813129Z 2025-05-07T20:32:33.2813310Z x_sign = torch.sign(x) 2025-05-07T20:32:33.2813591Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.2813997Z x = x_sign * x_clamp 2025-05-07T20:32:33.2814232Z x0 = x[:, :D] 2025-05-07T20:32:33.2814438Z x1 = x[:, D:] 2025-05-07T20:32:33.2814641Z 2025-05-07T20:32:33.2814821Z if contiguous: 2025-05-07T20:32:33.2815040Z x0 = x0.contiguous() 2025-05-07T20:32:33.2815294Z x1 = x1.contiguous() 2025-05-07T20:32:33.2815525Z 2025-05-07T20:32:33.2815708Z if scale_ub is not None: 2025-05-07T20:32:33.2815973Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.2816302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.2816598Z ) 2025-05-07T20:32:33.2816787Z else: 2025-05-07T20:32:33.2816996Z scale_ub_tensor = None 2025-05-07T20:32:33.2817236Z 2025-05-07T20:32:33.2817460Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.2817843Z op = silu_mul_quant 2025-05-07T20:32:33.2818079Z if compiled: 2025-05-07T20:32:33.2818316Z op = torch.compile(op) 2025-05-07T20:32:33.2818605Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2818878Z 2025-05-07T20:32:33.2819058Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.2819221Z 2025-05-07T20:32:33.2819317Z moe/activation_test.py:117: 2025-05-07T20:32:33.2819603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2819973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.2820250Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.2820797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.2821339Z return fn(*args, **kwargs) 2025-05-07T20:32:33.2821984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.2822655Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.2823177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.2823894Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.2824554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.2825082Z kernel = self.compile( 2025-05-07T20:32:33.2825610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.2826246Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.2826636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.2826858Z 2025-05-07T20:32:33.2827066Z self = 2025-05-07T20:32:33.2828121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.2829497Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd03ce0>} 2025-05-07T20:32:33.2830812Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.2831808Z context = 2025-05-07T20:32:33.2832084Z 2025-05-07T20:32:33.2832252Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.2832750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.2833212Z module_map=module_map) 2025-05-07T20:32:33.2833567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.2833910Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.2834159Z E ^ 2025-05-07T20:32:33.2834609Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.2835045Z 2025-05-07T20:32:33.2835459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.2835959Z 2025-05-07T20:32:33.2836065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.2836462Z self=, 2025-05-07T20:32:33.2836849Z T=128, 2025-05-07T20:32:33.2837033Z D=7168, 2025-05-07T20:32:33.2837216Z scale_ub=1200.0, 2025-05-07T20:32:33.2837506Z contiguous=False, 2025-05-07T20:32:33.2837728Z compiled=False, 2025-05-07T20:32:33.4394678Z ) 2025-05-07T20:32:33.4405706Z self = 2025-05-07T20:32:33.4406483Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.4406887Z 2025-05-07T20:32:33.4407014Z @given( 2025-05-07T20:32:33.4407352Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4407789Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4408446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4408793Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4409137Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4409431Z ) 2025-05-07T20:32:33.4409793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4410251Z def test_silu_mul_quant( 2025-05-07T20:32:33.4410505Z self, 2025-05-07T20:32:33.4410723Z T: int, 2025-05-07T20:32:33.4410940Z D: int, 2025-05-07T20:32:33.4411173Z scale_ub: Optional[float], 2025-05-07T20:32:33.4411451Z contiguous: bool, 2025-05-07T20:32:33.4411753Z compiled: bool, 2025-05-07T20:32:33.4412114Z ) -> None: 2025-05-07T20:32:33.4412356Z torch.manual_seed(2025) 2025-05-07T20:32:33.4412601Z 2025-05-07T20:32:33.4412886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4413253Z 2025-05-07T20:32:33.4413459Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4413887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4414217Z x = x_sign * x_clamp 2025-05-07T20:32:33.4414471Z x0 = x[:, :D] 2025-05-07T20:32:33.4414690Z x1 = x[:, D:] 2025-05-07T20:32:33.4414911Z 2025-05-07T20:32:33.4415112Z if contiguous: 2025-05-07T20:32:33.4415347Z x0 = x0.contiguous() 2025-05-07T20:32:33.4415623Z x1 = x1.contiguous() 2025-05-07T20:32:33.4415876Z 2025-05-07T20:32:33.4416074Z if scale_ub is not None: 2025-05-07T20:32:33.4416360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4416712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4417025Z ) 2025-05-07T20:32:33.4417369Z else: 2025-05-07T20:32:33.4417590Z scale_ub_tensor = None 2025-05-07T20:32:33.4417847Z 2025-05-07T20:32:33.4418085Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4418413Z op = silu_mul_quant 2025-05-07T20:32:33.4418677Z if compiled: 2025-05-07T20:32:33.4418929Z op = torch.compile(op) 2025-05-07T20:32:33.4419236Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4419522Z 2025-05-07T20:32:33.4419723Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4419900Z 2025-05-07T20:32:33.4420001Z moe/activation_test.py:117: 2025-05-07T20:32:33.4420309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4420654Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4420942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4421648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4422343Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4422878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4423568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4424237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4424775Z kernel = self.compile( 2025-05-07T20:32:33.4425310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4426109Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4426517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4426748Z 2025-05-07T20:32:33.4426966Z self = 2025-05-07T20:32:33.4428037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4429534Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf7bc180>} 2025-05-07T20:32:33.4430862Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4431882Z context = 2025-05-07T20:32:33.4432170Z 2025-05-07T20:32:33.4432345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4432907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4433384Z module_map=module_map) 2025-05-07T20:32:33.4433760Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4434123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4434389Z E ^ 2025-05-07T20:32:33.4434857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4435303Z 2025-05-07T20:32:33.4435726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4436240Z 2025-05-07T20:32:33.4436347Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4436768Z self=, 2025-05-07T20:32:33.4437176Z T=128, 2025-05-07T20:32:33.4437378Z D=5120, 2025-05-07T20:32:33.4437582Z scale_ub=None, 2025-05-07T20:32:33.4437859Z contiguous=False, 2025-05-07T20:32:33.4438099Z compiled=False, 2025-05-07T20:32:33.4438315Z ) 2025-05-07T20:32:33.4438639Z self = 2025-05-07T20:32:33.4439143Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:33.4439414Z 2025-05-07T20:32:33.4439495Z @given( 2025-05-07T20:32:33.4439737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4440062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4440373Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4440716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4441056Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4441352Z ) 2025-05-07T20:32:33.4441701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4442153Z def test_silu_mul_quant( 2025-05-07T20:32:33.4442403Z self, 2025-05-07T20:32:33.4442604Z T: int, 2025-05-07T20:32:33.4442811Z D: int, 2025-05-07T20:32:33.4443039Z scale_ub: Optional[float], 2025-05-07T20:32:33.4443316Z contiguous: bool, 2025-05-07T20:32:33.4443563Z compiled: bool, 2025-05-07T20:32:33.4443795Z ) -> None: 2025-05-07T20:32:33.4444009Z torch.manual_seed(2025) 2025-05-07T20:32:33.4444257Z 2025-05-07T20:32:33.4444534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4444877Z 2025-05-07T20:32:33.4445079Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4445373Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4445740Z x = x_sign * x_clamp 2025-05-07T20:32:33.4445980Z x0 = x[:, :D] 2025-05-07T20:32:33.4446204Z x1 = x[:, D:] 2025-05-07T20:32:33.4446420Z 2025-05-07T20:32:33.4446606Z if contiguous: 2025-05-07T20:32:33.4446848Z x0 = x0.contiguous() 2025-05-07T20:32:33.4447116Z x1 = x1.contiguous() 2025-05-07T20:32:33.4447356Z 2025-05-07T20:32:33.4447556Z if scale_ub is not None: 2025-05-07T20:32:33.4447839Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4448277Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4448596Z ) 2025-05-07T20:32:33.4448803Z else: 2025-05-07T20:32:33.4449016Z scale_ub_tensor = None 2025-05-07T20:32:33.4449278Z 2025-05-07T20:32:33.4449515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4449832Z op = silu_mul_quant 2025-05-07T20:32:33.4450094Z if compiled: 2025-05-07T20:32:33.4450348Z op = torch.compile(op) 2025-05-07T20:32:33.4450642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4450926Z 2025-05-07T20:32:33.4451131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4451297Z 2025-05-07T20:32:33.4451453Z moe/activation_test.py:117: 2025-05-07T20:32:33.4451756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4452097Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4452395Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4453079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4453858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4454404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4455125Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4455807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4456350Z kernel = self.compile( 2025-05-07T20:32:33.4456898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4457600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4458013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4458255Z 2025-05-07T20:32:33.4458462Z self = 2025-05-07T20:32:33.4459532Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4460893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd18180>} 2025-05-07T20:32:33.4462221Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4463236Z context = 2025-05-07T20:32:33.4463524Z 2025-05-07T20:32:33.4463702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4464233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4464696Z module_map=module_map) 2025-05-07T20:32:33.4465068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4465430Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4465737Z E ^ 2025-05-07T20:32:33.4466199Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4466651Z 2025-05-07T20:32:33.4467064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.4467572Z 2025-05-07T20:32:33.4467689Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.4468100Z self=, 2025-05-07T20:32:33.4468550Z T=128, 2025-05-07T20:32:33.4468748Z D=5120, 2025-05-07T20:32:33.4468945Z scale_ub=1200.0, 2025-05-07T20:32:33.4469178Z contiguous=True, 2025-05-07T20:32:33.4469406Z compiled=False, 2025-05-07T20:32:33.4469614Z ) 2025-05-07T20:32:33.4469942Z self = 2025-05-07T20:32:33.4470440Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:33.4470712Z 2025-05-07T20:32:33.4470801Z @given( 2025-05-07T20:32:33.4471035Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.4471359Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.4471723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.4472056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.4472391Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.4472688Z ) 2025-05-07T20:32:33.4473036Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.4473489Z def test_silu_mul_quant( 2025-05-07T20:32:33.4473739Z self, 2025-05-07T20:32:33.4473938Z T: int, 2025-05-07T20:32:33.4474146Z D: int, 2025-05-07T20:32:33.4474376Z scale_ub: Optional[float], 2025-05-07T20:32:33.4474657Z contiguous: bool, 2025-05-07T20:32:33.4474899Z compiled: bool, 2025-05-07T20:32:33.4475132Z ) -> None: 2025-05-07T20:32:33.4475362Z torch.manual_seed(2025) 2025-05-07T20:32:33.4475605Z 2025-05-07T20:32:33.4475887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.4476246Z 2025-05-07T20:32:33.4476452Z x_sign = torch.sign(x) 2025-05-07T20:32:33.4476752Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.4477102Z x = x_sign * x_clamp 2025-05-07T20:32:33.4477351Z x0 = x[:, :D] 2025-05-07T20:32:33.4477577Z x1 = x[:, D:] 2025-05-07T20:32:33.4477786Z 2025-05-07T20:32:33.4477977Z if contiguous: 2025-05-07T20:32:33.4478210Z x0 = x0.contiguous() 2025-05-07T20:32:33.4478461Z x1 = x1.contiguous() 2025-05-07T20:32:33.4478701Z 2025-05-07T20:32:33.4478901Z if scale_ub is not None: 2025-05-07T20:32:33.4479169Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.4479507Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.4479823Z ) 2025-05-07T20:32:33.4480015Z else: 2025-05-07T20:32:33.4480229Z scale_ub_tensor = None 2025-05-07T20:32:33.4480482Z 2025-05-07T20:32:33.4480705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.4481024Z op = silu_mul_quant 2025-05-07T20:32:33.4481279Z if compiled: 2025-05-07T20:32:33.4481540Z op = torch.compile(op) 2025-05-07T20:32:33.4481836Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4482109Z 2025-05-07T20:32:33.4482308Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.4482473Z 2025-05-07T20:32:33.4482572Z moe/activation_test.py:117: 2025-05-07T20:32:33.4482868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4483204Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.4483481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.4484162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.4484918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.4485477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.4486150Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.4486812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.4487386Z kernel = self.compile( 2025-05-07T20:32:33.4487916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.4488564Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.4488963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.4489190Z 2025-05-07T20:32:33.4489401Z self = 2025-05-07T20:32:33.4490500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.4491853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dae8c20>} 2025-05-07T20:32:33.4493177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.4494262Z context = 2025-05-07T20:32:33.4494544Z 2025-05-07T20:32:33.4494720Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.4495232Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.4495696Z module_map=module_map) 2025-05-07T20:32:33.4496063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.4496417Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.4496685Z E ^ 2025-05-07T20:32:33.4497215Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.4497658Z 2025-05-07T20:32:33.4498076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6037473Z 2025-05-07T20:32:33.6038709Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.6039997Z self=, 2025-05-07T20:32:33.6041063Z T=1, 2025-05-07T20:32:33.6041417Z D=7168, 2025-05-07T20:32:33.6041839Z scale_ub=1200.0, 2025-05-07T20:32:33.6042283Z contiguous=True, 2025-05-07T20:32:33.6042718Z compiled=True, 2025-05-07T20:32:33.6043117Z ) 2025-05-07T20:32:33.6043751Z self = 2025-05-07T20:32:33.6044513Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:33.6044796Z 2025-05-07T20:32:33.6044881Z @given( 2025-05-07T20:32:33.6045125Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6045446Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6045754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6046088Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6046421Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6046704Z ) 2025-05-07T20:32:33.6047061Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6047507Z def test_silu_mul_quant( 2025-05-07T20:32:33.6048083Z self, 2025-05-07T20:32:33.6048274Z T: int, 2025-05-07T20:32:33.6048478Z D: int, 2025-05-07T20:32:33.6048700Z scale_ub: Optional[float], 2025-05-07T20:32:33.6048969Z contiguous: bool, 2025-05-07T20:32:33.6049212Z compiled: bool, 2025-05-07T20:32:33.6049448Z ) -> None: 2025-05-07T20:32:33.6049666Z torch.manual_seed(2025) 2025-05-07T20:32:33.6049913Z 2025-05-07T20:32:33.6050187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6050632Z 2025-05-07T20:32:33.6050850Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6051145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6051450Z x = x_sign * x_clamp 2025-05-07T20:32:33.6051700Z x0 = x[:, :D] 2025-05-07T20:32:33.6051922Z x1 = x[:, D:] 2025-05-07T20:32:33.6052130Z 2025-05-07T20:32:33.6052326Z if contiguous: 2025-05-07T20:32:33.6052563Z x0 = x0.contiguous() 2025-05-07T20:32:33.6052835Z x1 = x1.contiguous() 2025-05-07T20:32:33.6053074Z 2025-05-07T20:32:33.6053277Z if scale_ub is not None: 2025-05-07T20:32:33.6053556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6054165Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6054485Z ) 2025-05-07T20:32:33.6054688Z else: 2025-05-07T20:32:33.6054900Z scale_ub_tensor = None 2025-05-07T20:32:33.6055161Z 2025-05-07T20:32:33.6055402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6055731Z op = silu_mul_quant 2025-05-07T20:32:33.6055993Z if compiled: 2025-05-07T20:32:33.6056249Z op = torch.compile(op) 2025-05-07T20:32:33.6058040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6058323Z 2025-05-07T20:32:33.6058526Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6058691Z 2025-05-07T20:32:33.6058791Z moe/activation_test.py:117: 2025-05-07T20:32:33.6059095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6059433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6059722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6060275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6060928Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6061583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6062259Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6062797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6063470Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6064131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6064655Z kernel = self.compile( 2025-05-07T20:32:33.6065199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6065853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6066257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6066485Z 2025-05-07T20:32:33.6066691Z self = 2025-05-07T20:32:33.6067762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6069131Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dae9ee0>} 2025-05-07T20:32:33.6070503Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6071511Z context = 2025-05-07T20:32:33.6071805Z 2025-05-07T20:32:33.6071971Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6072492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6072996Z module_map=module_map) 2025-05-07T20:32:33.6073360Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6073719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6073984Z E ^ 2025-05-07T20:32:33.6074440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6074897Z 2025-05-07T20:32:33.6075349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.6075863Z 2025-05-07T20:32:33.6076010Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.6076432Z self=, 2025-05-07T20:32:33.6076830Z T=1, 2025-05-07T20:32:33.6077023Z D=7168, 2025-05-07T20:32:33.6077223Z scale_ub=1200.0, 2025-05-07T20:32:33.6077448Z contiguous=False, 2025-05-07T20:32:33.6077676Z compiled=True, 2025-05-07T20:32:33.6077886Z ) 2025-05-07T20:32:33.6078200Z self = 2025-05-07T20:32:33.6078687Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.6078957Z 2025-05-07T20:32:33.6079035Z @given( 2025-05-07T20:32:33.6079274Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.6079587Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.6079898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.6080236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.6080566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.6080859Z ) 2025-05-07T20:32:33.6081261Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.6081706Z def test_silu_mul_quant( 2025-05-07T20:32:33.6081959Z self, 2025-05-07T20:32:33.6082161Z T: int, 2025-05-07T20:32:33.6082364Z D: int, 2025-05-07T20:32:33.6082578Z scale_ub: Optional[float], 2025-05-07T20:32:33.6082853Z contiguous: bool, 2025-05-07T20:32:33.6083099Z compiled: bool, 2025-05-07T20:32:33.6083322Z ) -> None: 2025-05-07T20:32:33.6083541Z torch.manual_seed(2025) 2025-05-07T20:32:33.6083791Z 2025-05-07T20:32:33.6084063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.6084414Z 2025-05-07T20:32:33.6084612Z x_sign = torch.sign(x) 2025-05-07T20:32:33.6084909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.6085267Z x = x_sign * x_clamp 2025-05-07T20:32:33.6085512Z x0 = x[:, :D] 2025-05-07T20:32:33.6085726Z x1 = x[:, D:] 2025-05-07T20:32:33.6085940Z 2025-05-07T20:32:33.6086128Z if contiguous: 2025-05-07T20:32:33.6086357Z x0 = x0.contiguous() 2025-05-07T20:32:33.6086622Z x1 = x1.contiguous() 2025-05-07T20:32:33.6086867Z 2025-05-07T20:32:33.6087058Z if scale_ub is not None: 2025-05-07T20:32:33.6087335Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.6087672Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.6087999Z ) 2025-05-07T20:32:33.6088202Z else: 2025-05-07T20:32:33.6088407Z scale_ub_tensor = None 2025-05-07T20:32:33.6088713Z 2025-05-07T20:32:33.6088948Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.6097264Z op = silu_mul_quant 2025-05-07T20:32:33.6097562Z if compiled: 2025-05-07T20:32:33.6097826Z op = torch.compile(op) 2025-05-07T20:32:33.6098128Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6098699Z 2025-05-07T20:32:33.6098901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.6099067Z 2025-05-07T20:32:33.6099180Z moe/activation_test.py:117: 2025-05-07T20:32:33.6099587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6099927Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.6100215Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.6100765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.6101334Z return fn(*args, **kwargs) 2025-05-07T20:32:33.6101995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.6102682Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.6103279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.6103958Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.6104620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.6105151Z kernel = self.compile( 2025-05-07T20:32:33.6105694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.6106353Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.6106758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.6106988Z 2025-05-07T20:32:33.6107195Z self = 2025-05-07T20:32:33.6108272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.6109693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779daeac00>} 2025-05-07T20:32:33.6111026Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.6112033Z context = 2025-05-07T20:32:33.6112314Z 2025-05-07T20:32:33.6112478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.6112998Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.6113469Z module_map=module_map) 2025-05-07T20:32:33.6113833Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.6114190Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.6114450Z E ^ 2025-05-07T20:32:33.6114906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.6115355Z 2025-05-07T20:32:33.6115765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.8152437Z 2025-05-07T20:32:33.8153222Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.8153860Z self=, 2025-05-07T20:32:33.8154412Z T=1, 2025-05-07T20:32:33.8155010Z D=7168, 2025-05-07T20:32:33.8155249Z scale_ub=None, 2025-05-07T20:32:33.8155526Z contiguous=False, 2025-05-07T20:32:33.8155814Z compiled=True, 2025-05-07T20:32:33.8156069Z ) 2025-05-07T20:32:33.8156392Z self = 2025-05-07T20:32:33.8156885Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:33.8157144Z 2025-05-07T20:32:33.8157223Z @given( 2025-05-07T20:32:33.8157458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.8157881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.8158181Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.8158511Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.8158840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.8159120Z ) 2025-05-07T20:32:33.8159462Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.8159903Z def test_silu_mul_quant( 2025-05-07T20:32:33.8160143Z self, 2025-05-07T20:32:33.8160328Z T: int, 2025-05-07T20:32:33.8160526Z D: int, 2025-05-07T20:32:33.8160743Z scale_ub: Optional[float], 2025-05-07T20:32:33.8161010Z contiguous: bool, 2025-05-07T20:32:33.8161336Z compiled: bool, 2025-05-07T20:32:33.8161568Z ) -> None: 2025-05-07T20:32:33.8161775Z torch.manual_seed(2025) 2025-05-07T20:32:33.8162014Z 2025-05-07T20:32:33.8162283Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.8162614Z 2025-05-07T20:32:33.8162803Z x_sign = torch.sign(x) 2025-05-07T20:32:33.8163094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.8163395Z x = x_sign * x_clamp 2025-05-07T20:32:33.8163634Z x0 = x[:, :D] 2025-05-07T20:32:33.8163848Z x1 = x[:, D:] 2025-05-07T20:32:33.8164085Z 2025-05-07T20:32:33.8164273Z if contiguous: 2025-05-07T20:32:33.8164499Z x0 = x0.contiguous() 2025-05-07T20:32:33.8164757Z x1 = x1.contiguous() 2025-05-07T20:32:33.8164997Z 2025-05-07T20:32:33.8165184Z if scale_ub is not None: 2025-05-07T20:32:33.8165456Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.8165796Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.8166186Z ) 2025-05-07T20:32:33.8166379Z else: 2025-05-07T20:32:33.8166591Z scale_ub_tensor = None 2025-05-07T20:32:33.8166845Z 2025-05-07T20:32:33.8167068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.8167383Z op = silu_mul_quant 2025-05-07T20:32:33.8167631Z if compiled: 2025-05-07T20:32:33.8167872Z op = torch.compile(op) 2025-05-07T20:32:33.8168168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.8168444Z 2025-05-07T20:32:33.8168635Z y_fp8, y_scale = fn() 2025-05-07T20:32:33.8168923Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:33.8169217Z 2025-05-07T20:32:33.8169448Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.8169782Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:33.8170073Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:33.8170389Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:33.8170736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.8171045Z 2025-05-07T20:32:33.8171248Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:33.8171439Z 2025-05-07T20:32:33.8171541Z moe/activation_test.py:126: 2025-05-07T20:32:33.8171835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8172171Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:33.8172488Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:33.8173264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:33.8174223Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:33.8174769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.8175436Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.8176115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:33.8176875Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:33.8177590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:33.8178210Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:33.8178804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:33.8179319Z fn() 2025-05-07T20:32:33.8179816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:33.8180390Z self.fn.run( 2025-05-07T20:32:33.8180899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.8181427Z kernel = self.compile( 2025-05-07T20:32:33.8181956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.8182603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.8182996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8183220Z 2025-05-07T20:32:33.8183431Z self = 2025-05-07T20:32:33.8184492Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.8185953Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc28180>} 2025-05-07T20:32:33.8188992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.8189998Z context = 2025-05-07T20:32:33.8190277Z 2025-05-07T20:32:33.8190441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.8190950Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.8191412Z module_map=module_map) 2025-05-07T20:32:33.8191773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.8192118Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:33.8192380Z E ^ 2025-05-07T20:32:33.8192838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.8193274Z 2025-05-07T20:32:33.8193680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.8194192Z 2025-05-07T20:32:33.8194295Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.8194703Z self=, 2025-05-07T20:32:33.8195098Z T=1, 2025-05-07T20:32:33.8195275Z D=5120, 2025-05-07T20:32:33.8195468Z scale_ub=1200.0, 2025-05-07T20:32:33.8195690Z contiguous=False, 2025-05-07T20:32:33.8195950Z compiled=True, 2025-05-07T20:32:33.8196156Z ) 2025-05-07T20:32:33.8196470Z self = 2025-05-07T20:32:33.8196946Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.8197213Z 2025-05-07T20:32:33.8197295Z @given( 2025-05-07T20:32:33.8197526Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.8197839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.8198141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.8198801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.8199125Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.8199405Z ) 2025-05-07T20:32:33.8199755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.8200189Z def test_silu_mul_quant( 2025-05-07T20:32:33.8200422Z self, 2025-05-07T20:32:33.8200616Z T: int, 2025-05-07T20:32:33.8200814Z D: int, 2025-05-07T20:32:33.8201024Z scale_ub: Optional[float], 2025-05-07T20:32:33.8201290Z contiguous: bool, 2025-05-07T20:32:33.8201526Z compiled: bool, 2025-05-07T20:32:33.8201740Z ) -> None: 2025-05-07T20:32:33.8201951Z torch.manual_seed(2025) 2025-05-07T20:32:33.8202265Z 2025-05-07T20:32:33.8202530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.8202870Z 2025-05-07T20:32:33.8203061Z x_sign = torch.sign(x) 2025-05-07T20:32:33.8203353Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.8203656Z x = x_sign * x_clamp 2025-05-07T20:32:33.8203891Z x0 = x[:, :D] 2025-05-07T20:32:33.8204104Z x1 = x[:, D:] 2025-05-07T20:32:33.8204305Z 2025-05-07T20:32:33.8204489Z if contiguous: 2025-05-07T20:32:33.8204718Z x0 = x0.contiguous() 2025-05-07T20:32:33.8204969Z x1 = x1.contiguous() 2025-05-07T20:32:33.8205210Z 2025-05-07T20:32:33.8205403Z if scale_ub is not None: 2025-05-07T20:32:33.8205665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.8205999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.8206305Z ) 2025-05-07T20:32:33.8206492Z else: 2025-05-07T20:32:33.8206711Z scale_ub_tensor = None 2025-05-07T20:32:33.8207026Z 2025-05-07T20:32:33.8207252Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.8207566Z op = silu_mul_quant 2025-05-07T20:32:33.8207835Z if compiled: 2025-05-07T20:32:33.8208082Z op = torch.compile(op) 2025-05-07T20:32:33.8208377Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.8208644Z 2025-05-07T20:32:33.8208838Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.8209010Z 2025-05-07T20:32:33.8209109Z moe/activation_test.py:117: 2025-05-07T20:32:33.8209401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8209725Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.8210009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.8210560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.8211116Z return fn(*args, **kwargs) 2025-05-07T20:32:33.8211769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.8212452Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.8212985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.8213646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.8214420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.8214946Z kernel = self.compile( 2025-05-07T20:32:33.8215550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.8216197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.8216595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.8216817Z 2025-05-07T20:32:33.8217036Z self = 2025-05-07T20:32:33.8218085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.8220215Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc29300>} 2025-05-07T20:32:33.8221536Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.8222538Z context = 2025-05-07T20:32:33.8222896Z 2025-05-07T20:32:33.8223069Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.8223574Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.8224041Z module_map=module_map) 2025-05-07T20:32:33.8224400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.8224741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.8225006Z E ^ 2025-05-07T20:32:33.8225458Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.8225894Z 2025-05-07T20:32:33.8226307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9634765Z 2025-05-07T20:32:33.9635389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9636600Z self=, 2025-05-07T20:32:33.9637683Z T=1, 2025-05-07T20:32:33.9638562Z D=5120, 2025-05-07T20:32:33.9638987Z scale_ub=1200.0, 2025-05-07T20:32:33.9639441Z contiguous=False, 2025-05-07T20:32:33.9639901Z compiled=False, 2025-05-07T20:32:33.9640331Z ) 2025-05-07T20:32:33.9640961Z self = 2025-05-07T20:32:33.9641942Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:33.9642473Z 2025-05-07T20:32:33.9642645Z @given( 2025-05-07T20:32:33.9643112Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9643747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9644369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9644794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9645148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9645441Z ) 2025-05-07T20:32:33.9645800Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9646248Z def test_silu_mul_quant( 2025-05-07T20:32:33.9646503Z self, 2025-05-07T20:32:33.9646705Z T: int, 2025-05-07T20:32:33.9646905Z D: int, 2025-05-07T20:32:33.9647127Z scale_ub: Optional[float], 2025-05-07T20:32:33.9647402Z contiguous: bool, 2025-05-07T20:32:33.9647644Z compiled: bool, 2025-05-07T20:32:33.9647874Z ) -> None: 2025-05-07T20:32:33.9648095Z torch.manual_seed(2025) 2025-05-07T20:32:33.9648339Z 2025-05-07T20:32:33.9648619Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9649055Z 2025-05-07T20:32:33.9649252Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9649549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9649865Z x = x_sign * x_clamp 2025-05-07T20:32:33.9650119Z x0 = x[:, :D] 2025-05-07T20:32:33.9650336Z x1 = x[:, D:] 2025-05-07T20:32:33.9650554Z 2025-05-07T20:32:33.9650747Z if contiguous: 2025-05-07T20:32:33.9650980Z x0 = x0.contiguous() 2025-05-07T20:32:33.9651253Z x1 = x1.contiguous() 2025-05-07T20:32:33.9651496Z 2025-05-07T20:32:33.9651775Z if scale_ub is not None: 2025-05-07T20:32:33.9652056Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9652396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9652707Z ) 2025-05-07T20:32:33.9652908Z else: 2025-05-07T20:32:33.9653126Z scale_ub_tensor = None 2025-05-07T20:32:33.9653378Z 2025-05-07T20:32:33.9653617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9654096Z op = silu_mul_quant 2025-05-07T20:32:33.9654344Z if compiled: 2025-05-07T20:32:33.9654599Z op = torch.compile(op) 2025-05-07T20:32:33.9654901Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9655181Z 2025-05-07T20:32:33.9655460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9655637Z 2025-05-07T20:32:33.9655740Z moe/activation_test.py:117: 2025-05-07T20:32:33.9656044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9656381Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9656671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9657371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9658053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9658594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9659277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9659940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9660478Z kernel = self.compile( 2025-05-07T20:32:33.9661067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9661728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9662140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9662368Z 2025-05-07T20:32:33.9662576Z self = 2025-05-07T20:32:33.9663640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9665002Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc2a020>} 2025-05-07T20:32:33.9666335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9667345Z context = 2025-05-07T20:32:33.9667635Z 2025-05-07T20:32:33.9667803Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9668326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9668795Z module_map=module_map) 2025-05-07T20:32:33.9669158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9669564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9669830Z E ^ 2025-05-07T20:32:33.9670288Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9670740Z 2025-05-07T20:32:33.9671155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9671669Z 2025-05-07T20:32:33.9671774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9672234Z self=, 2025-05-07T20:32:33.9672636Z T=16384, 2025-05-07T20:32:33.9672839Z D=5120, 2025-05-07T20:32:33.9673043Z scale_ub=1200.0, 2025-05-07T20:32:33.9673269Z contiguous=False, 2025-05-07T20:32:33.9673500Z compiled=True, 2025-05-07T20:32:33.9673711Z ) 2025-05-07T20:32:33.9674030Z self = 2025-05-07T20:32:33.9674535Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:33.9674816Z 2025-05-07T20:32:33.9674900Z @given( 2025-05-07T20:32:33.9675186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:33.9675543Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:33.9675861Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:33.9676200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:33.9676527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:33.9676821Z ) 2025-05-07T20:32:33.9677174Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:33.9677616Z def test_silu_mul_quant( 2025-05-07T20:32:33.9677866Z self, 2025-05-07T20:32:33.9678072Z T: int, 2025-05-07T20:32:33.9678279Z D: int, 2025-05-07T20:32:33.9678500Z scale_ub: Optional[float], 2025-05-07T20:32:33.9678784Z contiguous: bool, 2025-05-07T20:32:33.9679032Z compiled: bool, 2025-05-07T20:32:33.9679257Z ) -> None: 2025-05-07T20:32:33.9679479Z torch.manual_seed(2025) 2025-05-07T20:32:33.9679727Z 2025-05-07T20:32:33.9679999Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:33.9680350Z 2025-05-07T20:32:33.9680601Z x_sign = torch.sign(x) 2025-05-07T20:32:33.9680898Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:33.9681215Z x = x_sign * x_clamp 2025-05-07T20:32:33.9681465Z x0 = x[:, :D] 2025-05-07T20:32:33.9681684Z x1 = x[:, D:] 2025-05-07T20:32:33.9681901Z 2025-05-07T20:32:33.9682096Z if contiguous: 2025-05-07T20:32:33.9682330Z x0 = x0.contiguous() 2025-05-07T20:32:33.9682593Z x1 = x1.contiguous() 2025-05-07T20:32:33.9682840Z 2025-05-07T20:32:33.9683034Z if scale_ub is not None: 2025-05-07T20:32:33.9683315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:33.9683656Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:33.9683971Z ) 2025-05-07T20:32:33.9684165Z else: 2025-05-07T20:32:33.9684381Z scale_ub_tensor = None 2025-05-07T20:32:33.9684637Z 2025-05-07T20:32:33.9684869Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:33.9685193Z op = silu_mul_quant 2025-05-07T20:32:33.9685453Z if compiled: 2025-05-07T20:32:33.9685703Z op = torch.compile(op) 2025-05-07T20:32:33.9686008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9686291Z 2025-05-07T20:32:33.9686487Z > y_fp8, y_scale = fn() 2025-05-07T20:32:33.9686658Z 2025-05-07T20:32:33.9686759Z moe/activation_test.py:117: 2025-05-07T20:32:33.9687059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9687398Z moe/activation_test.py:115: in fn 2025-05-07T20:32:33.9687680Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:33.9688293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:33.9688856Z return fn(*args, **kwargs) 2025-05-07T20:32:33.9689511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:33.9690195Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:33.9690734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:33.9691459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:33.9692135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:33.9692672Z kernel = self.compile( 2025-05-07T20:32:33.9693218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:33.9693963Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:33.9694369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:33.9694596Z 2025-05-07T20:32:33.9694853Z self = 2025-05-07T20:32:33.9695931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:33.9705265Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc2b600>} 2025-05-07T20:32:33.9708133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:33.9709145Z context = 2025-05-07T20:32:33.9709441Z 2025-05-07T20:32:33.9709610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:33.9710252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:33.9710725Z module_map=module_map) 2025-05-07T20:32:33.9711088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:33.9711450Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:33.9711717Z E ^ 2025-05-07T20:32:33.9712172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:33.9712623Z 2025-05-07T20:32:33.9713041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:33.9713556Z 2025-05-07T20:32:33.9713661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:33.9714077Z self=, 2025-05-07T20:32:33.9714474Z T=2048, 2025-05-07T20:32:33.9714675Z D=7168, 2025-05-07T20:32:33.9714883Z scale_ub=1200.0, 2025-05-07T20:32:33.9715112Z contiguous=False, 2025-05-07T20:32:33.9715349Z compiled=True, 2025-05-07T20:32:34.1572001Z ) 2025-05-07T20:32:34.1572592Z self = 2025-05-07T20:32:34.1573295Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.1573764Z 2025-05-07T20:32:34.1573858Z @given( 2025-05-07T20:32:34.1574094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1574413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1574726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1575054Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1575677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1575969Z ) 2025-05-07T20:32:34.1576321Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1576777Z def test_silu_mul_quant( 2025-05-07T20:32:34.1577027Z self, 2025-05-07T20:32:34.1577233Z T: int, 2025-05-07T20:32:34.1577440Z D: int, 2025-05-07T20:32:34.1577671Z scale_ub: Optional[float], 2025-05-07T20:32:34.1577957Z contiguous: bool, 2025-05-07T20:32:34.1578304Z compiled: bool, 2025-05-07T20:32:34.1578544Z ) -> None: 2025-05-07T20:32:34.1578770Z torch.manual_seed(2025) 2025-05-07T20:32:34.1579020Z 2025-05-07T20:32:34.1579292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1579642Z 2025-05-07T20:32:34.1579846Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1580138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1580461Z x = x_sign * x_clamp 2025-05-07T20:32:34.1580708Z x0 = x[:, :D] 2025-05-07T20:32:34.1580931Z x1 = x[:, D:] 2025-05-07T20:32:34.1581142Z 2025-05-07T20:32:34.1581336Z if contiguous: 2025-05-07T20:32:34.1581669Z x0 = x0.contiguous() 2025-05-07T20:32:34.1581933Z x1 = x1.contiguous() 2025-05-07T20:32:34.1582184Z 2025-05-07T20:32:34.1582387Z if scale_ub is not None: 2025-05-07T20:32:34.1582660Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1583005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1583324Z ) 2025-05-07T20:32:34.1583517Z else: 2025-05-07T20:32:34.1583738Z scale_ub_tensor = None 2025-05-07T20:32:34.1583995Z 2025-05-07T20:32:34.1584227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1584554Z op = silu_mul_quant 2025-05-07T20:32:34.1584808Z if compiled: 2025-05-07T20:32:34.1585056Z op = torch.compile(op) 2025-05-07T20:32:34.1585356Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1585635Z 2025-05-07T20:32:34.1585833Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1585996Z 2025-05-07T20:32:34.1586101Z moe/activation_test.py:117: 2025-05-07T20:32:34.1586490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1586835Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1587118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1587684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.1588244Z return fn(*args, **kwargs) 2025-05-07T20:32:34.1588902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1589585Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1590124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1590803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1591462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1591999Z kernel = self.compile( 2025-05-07T20:32:34.1592544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1593203Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1593598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1593841Z 2025-05-07T20:32:34.1594046Z self = 2025-05-07T20:32:34.1595114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1596536Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d608720>} 2025-05-07T20:32:34.1597858Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1599205Z context = 2025-05-07T20:32:34.1599495Z 2025-05-07T20:32:34.1599665Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1600192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1600654Z module_map=module_map) 2025-05-07T20:32:34.1601032Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1601391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1601651Z E ^ 2025-05-07T20:32:34.1602186Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1602736Z 2025-05-07T20:32:34.1603148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1603658Z 2025-05-07T20:32:34.1603769Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1604181Z self=, 2025-05-07T20:32:34.1604585Z T=1, 2025-05-07T20:32:34.1604777Z D=5120, 2025-05-07T20:32:34.1604972Z scale_ub=None, 2025-05-07T20:32:34.1605196Z contiguous=False, 2025-05-07T20:32:34.1605430Z compiled=False, 2025-05-07T20:32:34.1605642Z ) 2025-05-07T20:32:34.1605957Z self = 2025-05-07T20:32:34.1606442Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.1606700Z 2025-05-07T20:32:34.1606785Z @given( 2025-05-07T20:32:34.1607020Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1607422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1607737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1608063Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1608397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1608685Z ) 2025-05-07T20:32:34.1609039Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1609475Z def test_silu_mul_quant( 2025-05-07T20:32:34.1609719Z self, 2025-05-07T20:32:34.1609917Z T: int, 2025-05-07T20:32:34.1610112Z D: int, 2025-05-07T20:32:34.1610339Z scale_ub: Optional[float], 2025-05-07T20:32:34.1610614Z contiguous: bool, 2025-05-07T20:32:34.1610851Z compiled: bool, 2025-05-07T20:32:34.1611077Z ) -> None: 2025-05-07T20:32:34.1611296Z torch.manual_seed(2025) 2025-05-07T20:32:34.1611533Z 2025-05-07T20:32:34.1611811Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1612160Z 2025-05-07T20:32:34.1612353Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1612650Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1612964Z x = x_sign * x_clamp 2025-05-07T20:32:34.1613200Z x0 = x[:, :D] 2025-05-07T20:32:34.1613420Z x1 = x[:, D:] 2025-05-07T20:32:34.1613629Z 2025-05-07T20:32:34.1613939Z if contiguous: 2025-05-07T20:32:34.1614177Z x0 = x0.contiguous() 2025-05-07T20:32:34.1614440Z x1 = x1.contiguous() 2025-05-07T20:32:34.1614706Z 2025-05-07T20:32:34.1614921Z if scale_ub is not None: 2025-05-07T20:32:34.1615272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1615615Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1615924Z ) 2025-05-07T20:32:34.1616125Z else: 2025-05-07T20:32:34.1616342Z scale_ub_tensor = None 2025-05-07T20:32:34.1616598Z 2025-05-07T20:32:34.1616837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1617157Z op = silu_mul_quant 2025-05-07T20:32:34.1617406Z if compiled: 2025-05-07T20:32:34.1617724Z op = torch.compile(op) 2025-05-07T20:32:34.1618023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1618297Z 2025-05-07T20:32:34.1618495Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1618658Z 2025-05-07T20:32:34.1618767Z moe/activation_test.py:117: 2025-05-07T20:32:34.1619063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1619386Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1619671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1620355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1621074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1621614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1622291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1622963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1623490Z kernel = self.compile( 2025-05-07T20:32:34.1624028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1624715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1625130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1625368Z 2025-05-07T20:32:34.1625574Z self = 2025-05-07T20:32:34.1626683Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1628034Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d609120>} 2025-05-07T20:32:34.1629358Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1630362Z context = 2025-05-07T20:32:34.1630657Z 2025-05-07T20:32:34.1630824Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1631341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1631809Z module_map=module_map) 2025-05-07T20:32:34.1632174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1632530Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1632802Z E ^ 2025-05-07T20:32:34.1633259Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1633712Z 2025-05-07T20:32:34.1634121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.1634636Z 2025-05-07T20:32:34.1634740Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.1635151Z self=, 2025-05-07T20:32:34.1635593Z T=4096, 2025-05-07T20:32:34.1635786Z D=7168, 2025-05-07T20:32:34.1635991Z scale_ub=1200.0, 2025-05-07T20:32:34.1636215Z contiguous=False, 2025-05-07T20:32:34.1636449Z compiled=False, 2025-05-07T20:32:34.1636673Z ) 2025-05-07T20:32:34.1637026Z self = 2025-05-07T20:32:34.1637524Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.1637870Z 2025-05-07T20:32:34.1637956Z @given( 2025-05-07T20:32:34.1638186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.1638504Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.1638814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.1639142Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.1639473Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.1639766Z ) 2025-05-07T20:32:34.1640120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.1640558Z def test_silu_mul_quant( 2025-05-07T20:32:34.1640804Z self, 2025-05-07T20:32:34.1641004Z T: int, 2025-05-07T20:32:34.1641243Z D: int, 2025-05-07T20:32:34.1641467Z scale_ub: Optional[float], 2025-05-07T20:32:34.1641745Z contiguous: bool, 2025-05-07T20:32:34.1641982Z compiled: bool, 2025-05-07T20:32:34.1642210Z ) -> None: 2025-05-07T20:32:34.1642433Z torch.manual_seed(2025) 2025-05-07T20:32:34.1642673Z 2025-05-07T20:32:34.1642944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.1643286Z 2025-05-07T20:32:34.1643476Z x_sign = torch.sign(x) 2025-05-07T20:32:34.1643770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.1644081Z x = x_sign * x_clamp 2025-05-07T20:32:34.1644314Z x0 = x[:, :D] 2025-05-07T20:32:34.1644539Z x1 = x[:, D:] 2025-05-07T20:32:34.1644749Z 2025-05-07T20:32:34.1644931Z if contiguous: 2025-05-07T20:32:34.1645165Z x0 = x0.contiguous() 2025-05-07T20:32:34.1645427Z x1 = x1.contiguous() 2025-05-07T20:32:34.1645671Z 2025-05-07T20:32:34.1645864Z if scale_ub is not None: 2025-05-07T20:32:34.1646192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.1646531Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.1646834Z ) 2025-05-07T20:32:34.1647036Z else: 2025-05-07T20:32:34.1647249Z scale_ub_tensor = None 2025-05-07T20:32:34.1647495Z 2025-05-07T20:32:34.1647727Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.1648043Z op = silu_mul_quant 2025-05-07T20:32:34.1648290Z if compiled: 2025-05-07T20:32:34.1648545Z op = torch.compile(op) 2025-05-07T20:32:34.1648852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1649129Z 2025-05-07T20:32:34.1649332Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.1649503Z 2025-05-07T20:32:34.1649607Z moe/activation_test.py:117: 2025-05-07T20:32:34.1649905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1650239Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.1650529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.1651210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.1651891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.1652430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.1653107Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.1653882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.1654463Z kernel = self.compile( 2025-05-07T20:32:34.1655007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.1655716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.1656116Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.1656350Z 2025-05-07T20:32:34.1656556Z self = 2025-05-07T20:32:34.1657665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.1659010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d60a480>} 2025-05-07T20:32:34.1660333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.1661376Z context = 2025-05-07T20:32:34.1661671Z 2025-05-07T20:32:34.1661838Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.1662359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.1662826Z module_map=module_map) 2025-05-07T20:32:34.1663186Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.1663540Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.1663802Z E ^ 2025-05-07T20:32:34.1664254Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.1664700Z 2025-05-07T20:32:34.1665138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.3215252Z 2025-05-07T20:32:34.3215747Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3216656Z self=, 2025-05-07T20:32:34.3217185Z T=16384, 2025-05-07T20:32:34.3217423Z D=7168, 2025-05-07T20:32:34.3217615Z scale_ub=None, 2025-05-07T20:32:34.3217840Z contiguous=True, 2025-05-07T20:32:34.3218067Z compiled=True, 2025-05-07T20:32:34.3218275Z ) 2025-05-07T20:32:34.3218597Z self = 2025-05-07T20:32:34.3219089Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:34.3219358Z 2025-05-07T20:32:34.3219438Z @given( 2025-05-07T20:32:34.3219675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.3219991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.3220299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.3220623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.3220954Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.3221242Z ) 2025-05-07T20:32:34.3221588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.3222035Z def test_silu_mul_quant( 2025-05-07T20:32:34.3222282Z self, 2025-05-07T20:32:34.3222484Z T: int, 2025-05-07T20:32:34.3222686Z D: int, 2025-05-07T20:32:34.3222909Z scale_ub: Optional[float], 2025-05-07T20:32:34.3223177Z contiguous: bool, 2025-05-07T20:32:34.3223422Z compiled: bool, 2025-05-07T20:32:34.3223652Z ) -> None: 2025-05-07T20:32:34.3223865Z torch.manual_seed(2025) 2025-05-07T20:32:34.3224111Z 2025-05-07T20:32:34.3224388Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.3224818Z 2025-05-07T20:32:34.3225022Z x_sign = torch.sign(x) 2025-05-07T20:32:34.3225317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.3225634Z x = x_sign * x_clamp 2025-05-07T20:32:34.3225877Z x0 = x[:, :D] 2025-05-07T20:32:34.3226098Z x1 = x[:, D:] 2025-05-07T20:32:34.3226316Z 2025-05-07T20:32:34.3226502Z if contiguous: 2025-05-07T20:32:34.3226736Z x0 = x0.contiguous() 2025-05-07T20:32:34.3227136Z x1 = x1.contiguous() 2025-05-07T20:32:34.3227374Z 2025-05-07T20:32:34.3227571Z if scale_ub is not None: 2025-05-07T20:32:34.3227845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.3228176Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.3228487Z ) 2025-05-07T20:32:34.3228684Z else: 2025-05-07T20:32:34.3228891Z scale_ub_tensor = None 2025-05-07T20:32:34.3229174Z 2025-05-07T20:32:34.3229425Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.3229779Z op = silu_mul_quant 2025-05-07T20:32:34.3230045Z if compiled: 2025-05-07T20:32:34.3230315Z op = torch.compile(op) 2025-05-07T20:32:34.3230728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3231033Z 2025-05-07T20:32:34.3231238Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.3231425Z 2025-05-07T20:32:34.3231534Z moe/activation_test.py:117: 2025-05-07T20:32:34.3231860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3232231Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.3232540Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3233188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.3233849Z return fn(*args, **kwargs) 2025-05-07T20:32:34.3234632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.3235457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.3236085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.3236942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.3237607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.3238144Z kernel = self.compile( 2025-05-07T20:32:34.3238684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.3239339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.3239743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3239976Z 2025-05-07T20:32:34.3240183Z self = 2025-05-07T20:32:34.3241255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.3242622Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d60b740>} 2025-05-07T20:32:34.3243950Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.3244961Z context = 2025-05-07T20:32:34.3245252Z 2025-05-07T20:32:34.3245419Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.3245989Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.3246460Z module_map=module_map) 2025-05-07T20:32:34.3246838Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.3247189Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.3247462Z E ^ 2025-05-07T20:32:34.3247926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.3248411Z 2025-05-07T20:32:34.3248820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.3249335Z 2025-05-07T20:32:34.3249442Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.3249858Z self=, 2025-05-07T20:32:34.3250266Z T=4096, 2025-05-07T20:32:34.3250456Z D=5120, 2025-05-07T20:32:34.3250657Z scale_ub=None, 2025-05-07T20:32:34.3250881Z contiguous=False, 2025-05-07T20:32:34.3251102Z compiled=True, 2025-05-07T20:32:34.3251307Z ) 2025-05-07T20:32:34.3251625Z self = 2025-05-07T20:32:34.3252157Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.3252433Z 2025-05-07T20:32:34.3252513Z @given( 2025-05-07T20:32:34.3252748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.3253065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.3253375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.3253877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.3254210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.3254494Z ) 2025-05-07T20:32:34.3254842Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.3255290Z def test_silu_mul_quant( 2025-05-07T20:32:34.3255531Z self, 2025-05-07T20:32:34.3255732Z T: int, 2025-05-07T20:32:34.3255932Z D: int, 2025-05-07T20:32:34.3256146Z scale_ub: Optional[float], 2025-05-07T20:32:34.3256423Z contiguous: bool, 2025-05-07T20:32:34.3256675Z compiled: bool, 2025-05-07T20:32:34.3256941Z ) -> None: 2025-05-07T20:32:34.3257164Z torch.manual_seed(2025) 2025-05-07T20:32:34.3257413Z 2025-05-07T20:32:34.3257679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.3258027Z 2025-05-07T20:32:34.3258224Z x_sign = torch.sign(x) 2025-05-07T20:32:34.3258514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.3258841Z x = x_sign * x_clamp 2025-05-07T20:32:34.3259086Z x0 = x[:, :D] 2025-05-07T20:32:34.3259308Z x1 = x[:, D:] 2025-05-07T20:32:34.3259517Z 2025-05-07T20:32:34.3259708Z if contiguous: 2025-05-07T20:32:34.3259947Z x0 = x0.contiguous() 2025-05-07T20:32:34.3268396Z x1 = x1.contiguous() 2025-05-07T20:32:34.3268664Z 2025-05-07T20:32:34.3268865Z if scale_ub is not None: 2025-05-07T20:32:34.3269137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.3269481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.3269796Z ) 2025-05-07T20:32:34.3269986Z else: 2025-05-07T20:32:34.3270198Z scale_ub_tensor = None 2025-05-07T20:32:34.3270456Z 2025-05-07T20:32:34.3270687Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.3271005Z op = silu_mul_quant 2025-05-07T20:32:34.3271261Z if compiled: 2025-05-07T20:32:34.3271503Z op = torch.compile(op) 2025-05-07T20:32:34.3271800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3272077Z 2025-05-07T20:32:34.3272264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.3272560Z 2025-05-07T20:32:34.3272658Z moe/activation_test.py:117: 2025-05-07T20:32:34.3272954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3273286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.3273564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.3274130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.3274696Z return fn(*args, **kwargs) 2025-05-07T20:32:34.3275347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.3276107Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.3276644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.3277319Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.3277975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.3278510Z kernel = self.compile( 2025-05-07T20:32:34.3279051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.3279742Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.3280146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.3280382Z 2025-05-07T20:32:34.3280592Z self = 2025-05-07T20:32:34.3281660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.3283008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf148c20>} 2025-05-07T20:32:34.3284328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.3285379Z context = 2025-05-07T20:32:34.3285666Z 2025-05-07T20:32:34.3285832Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.3286346Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.3286811Z module_map=module_map) 2025-05-07T20:32:34.3287168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.3287516Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.3287776Z E ^ 2025-05-07T20:32:34.3288224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.3288671Z 2025-05-07T20:32:34.3289077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4657841Z 2025-05-07T20:32:34.4658250Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4658913Z self=, 2025-05-07T20:32:34.4659477Z T=4096, 2025-05-07T20:32:34.4659744Z D=5120, 2025-05-07T20:32:34.4659947Z scale_ub=1200.0, 2025-05-07T20:32:34.4660177Z contiguous=False, 2025-05-07T20:32:34.4660418Z compiled=False, 2025-05-07T20:32:34.4660636Z ) 2025-05-07T20:32:34.4660955Z self = 2025-05-07T20:32:34.4661464Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.4661743Z 2025-05-07T20:32:34.4661822Z @given( 2025-05-07T20:32:34.4662237Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4662550Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4662896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4663232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4663561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4663855Z ) 2025-05-07T20:32:34.4664208Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4664789Z def test_silu_mul_quant( 2025-05-07T20:32:34.4665030Z self, 2025-05-07T20:32:34.4665232Z T: int, 2025-05-07T20:32:34.4665435Z D: int, 2025-05-07T20:32:34.4665663Z scale_ub: Optional[float], 2025-05-07T20:32:34.4665963Z contiguous: bool, 2025-05-07T20:32:34.4666225Z compiled: bool, 2025-05-07T20:32:34.4666471Z ) -> None: 2025-05-07T20:32:34.4666706Z torch.manual_seed(2025) 2025-05-07T20:32:34.4666976Z 2025-05-07T20:32:34.4667272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4667662Z 2025-05-07T20:32:34.4667868Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4668186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4668619Z x = x_sign * x_clamp 2025-05-07T20:32:34.4668873Z x0 = x[:, :D] 2025-05-07T20:32:34.4669093Z x1 = x[:, D:] 2025-05-07T20:32:34.4669308Z 2025-05-07T20:32:34.4669501Z if contiguous: 2025-05-07T20:32:34.4669737Z x0 = x0.contiguous() 2025-05-07T20:32:34.4670003Z x1 = x1.contiguous() 2025-05-07T20:32:34.4670248Z 2025-05-07T20:32:34.4670439Z if scale_ub is not None: 2025-05-07T20:32:34.4670720Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4671062Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4671375Z ) 2025-05-07T20:32:34.4671572Z else: 2025-05-07T20:32:34.4671795Z scale_ub_tensor = None 2025-05-07T20:32:34.4672055Z 2025-05-07T20:32:34.4672286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4672608Z op = silu_mul_quant 2025-05-07T20:32:34.4672862Z if compiled: 2025-05-07T20:32:34.4673111Z op = torch.compile(op) 2025-05-07T20:32:34.4673506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4673785Z 2025-05-07T20:32:34.4673978Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4674151Z 2025-05-07T20:32:34.4674253Z moe/activation_test.py:117: 2025-05-07T20:32:34.4674552Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4674888Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4675170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4675865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4676563Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4677099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4677789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4678457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4678997Z kernel = self.compile( 2025-05-07T20:32:34.4679535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4680199Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4680607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4680836Z 2025-05-07T20:32:34.4681043Z self = 2025-05-07T20:32:34.4682169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4683554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf1496c0>} 2025-05-07T20:32:34.4684885Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4685942Z context = 2025-05-07T20:32:34.4686227Z 2025-05-07T20:32:34.4686396Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4686922Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4687395Z module_map=module_map) 2025-05-07T20:32:34.4687768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4688124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4688395Z E ^ 2025-05-07T20:32:34.4688906Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4689352Z 2025-05-07T20:32:34.4689763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4690284Z 2025-05-07T20:32:34.4690392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4690808Z self=, 2025-05-07T20:32:34.4691219Z T=4096, 2025-05-07T20:32:34.4691412Z D=5120, 2025-05-07T20:32:34.4691617Z scale_ub=1200.0, 2025-05-07T20:32:34.4691853Z contiguous=False, 2025-05-07T20:32:34.4692088Z compiled=True, 2025-05-07T20:32:34.4692301Z ) 2025-05-07T20:32:34.4692628Z self = 2025-05-07T20:32:34.4693117Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.4693399Z 2025-05-07T20:32:34.4693484Z @given( 2025-05-07T20:32:34.4693921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.4694239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.4694544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.4694879Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.4695205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.4695487Z ) 2025-05-07T20:32:34.4695833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.4696270Z def test_silu_mul_quant( 2025-05-07T20:32:34.4696507Z self, 2025-05-07T20:32:34.4696706Z T: int, 2025-05-07T20:32:34.4696904Z D: int, 2025-05-07T20:32:34.4697117Z scale_ub: Optional[float], 2025-05-07T20:32:34.4697387Z contiguous: bool, 2025-05-07T20:32:34.4697624Z compiled: bool, 2025-05-07T20:32:34.4697845Z ) -> None: 2025-05-07T20:32:34.4698066Z torch.manual_seed(2025) 2025-05-07T20:32:34.4698594Z 2025-05-07T20:32:34.4698864Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.4699210Z 2025-05-07T20:32:34.4699405Z x_sign = torch.sign(x) 2025-05-07T20:32:34.4699697Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.4700001Z x = x_sign * x_clamp 2025-05-07T20:32:34.4700242Z x0 = x[:, :D] 2025-05-07T20:32:34.4700460Z x1 = x[:, D:] 2025-05-07T20:32:34.4700664Z 2025-05-07T20:32:34.4700852Z if contiguous: 2025-05-07T20:32:34.4701083Z x0 = x0.contiguous() 2025-05-07T20:32:34.4701338Z x1 = x1.contiguous() 2025-05-07T20:32:34.4701657Z 2025-05-07T20:32:34.4701851Z if scale_ub is not None: 2025-05-07T20:32:34.4702123Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.4702456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.4702786Z ) 2025-05-07T20:32:34.4702982Z else: 2025-05-07T20:32:34.4703201Z scale_ub_tensor = None 2025-05-07T20:32:34.4703456Z 2025-05-07T20:32:34.4703685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.4704074Z op = silu_mul_quant 2025-05-07T20:32:34.4704326Z if compiled: 2025-05-07T20:32:34.4704576Z op = torch.compile(op) 2025-05-07T20:32:34.4704873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4705148Z 2025-05-07T20:32:34.4705346Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.4705509Z 2025-05-07T20:32:34.4705608Z moe/activation_test.py:117: 2025-05-07T20:32:34.4705901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4706236Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.4706513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.4707128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.4707690Z return fn(*args, **kwargs) 2025-05-07T20:32:34.4708351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.4709027Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.4709565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.4710237Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.4710888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.4711424Z kernel = self.compile( 2025-05-07T20:32:34.4711961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.4712610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.4713004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.4713327Z 2025-05-07T20:32:34.4713535Z self = 2025-05-07T20:32:34.4714595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.4715940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf14afc0>} 2025-05-07T20:32:34.4717253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.4718265Z context = 2025-05-07T20:32:34.4718553Z 2025-05-07T20:32:34.4718719Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.4719235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.4719696Z module_map=module_map) 2025-05-07T20:32:34.4720066Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.4720420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.4720681Z E ^ 2025-05-07T20:32:34.4721130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.4721625Z 2025-05-07T20:32:34.4722031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.4722536Z 2025-05-07T20:32:34.4722648Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.4723055Z self=, 2025-05-07T20:32:34.4723459Z T=2048, 2025-05-07T20:32:34.4723653Z D=7168, 2025-05-07T20:32:34.4723848Z scale_ub=1200.0, 2025-05-07T20:32:34.4724071Z contiguous=False, 2025-05-07T20:32:34.4724346Z compiled=False, 2025-05-07T20:32:34.6675698Z ) 2025-05-07T20:32:34.6676244Z self = 2025-05-07T20:32:34.6676973Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:34.6677355Z 2025-05-07T20:32:34.6677464Z @given( 2025-05-07T20:32:34.6677786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6678128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6678464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6678801Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6679132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6679428Z ) 2025-05-07T20:32:34.6680074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6680521Z def test_silu_mul_quant( 2025-05-07T20:32:34.6680769Z self, 2025-05-07T20:32:34.6680976Z T: int, 2025-05-07T20:32:34.6681179Z D: int, 2025-05-07T20:32:34.6681396Z scale_ub: Optional[float], 2025-05-07T20:32:34.6681672Z contiguous: bool, 2025-05-07T20:32:34.6681915Z compiled: bool, 2025-05-07T20:32:34.6682141Z ) -> None: 2025-05-07T20:32:34.6682361Z torch.manual_seed(2025) 2025-05-07T20:32:34.6682606Z 2025-05-07T20:32:34.6682876Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6683224Z 2025-05-07T20:32:34.6683421Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6683709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6684022Z x = x_sign * x_clamp 2025-05-07T20:32:34.6684275Z x0 = x[:, :D] 2025-05-07T20:32:34.6684494Z x1 = x[:, D:] 2025-05-07T20:32:34.6684730Z 2025-05-07T20:32:34.6685035Z if contiguous: 2025-05-07T20:32:34.6685268Z x0 = x0.contiguous() 2025-05-07T20:32:34.6685534Z x1 = x1.contiguous() 2025-05-07T20:32:34.6685783Z 2025-05-07T20:32:34.6685970Z if scale_ub is not None: 2025-05-07T20:32:34.6686286Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.6686611Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.6686916Z ) 2025-05-07T20:32:34.6687111Z else: 2025-05-07T20:32:34.6687315Z scale_ub_tensor = None 2025-05-07T20:32:34.6687566Z 2025-05-07T20:32:34.6687799Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.6688108Z op = silu_mul_quant 2025-05-07T20:32:34.6688362Z if compiled: 2025-05-07T20:32:34.6688611Z op = torch.compile(op) 2025-05-07T20:32:34.6688911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6689186Z 2025-05-07T20:32:34.6689387Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.6689553Z 2025-05-07T20:32:34.6689658Z moe/activation_test.py:117: 2025-05-07T20:32:34.6689950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6690288Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.6690573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6691250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.6691937Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.6692475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.6693239Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.6694064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.6694596Z kernel = self.compile( 2025-05-07T20:32:34.6695140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.6695878Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.6696272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6696504Z 2025-05-07T20:32:34.6696709Z self = 2025-05-07T20:32:34.6697766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.6699485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf14bec0>} 2025-05-07T20:32:34.6700810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.6701823Z context = 2025-05-07T20:32:34.6702112Z 2025-05-07T20:32:34.6702277Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.6702792Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.6703248Z module_map=module_map) 2025-05-07T20:32:34.6703619Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.6703972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.6704225Z E ^ 2025-05-07T20:32:34.6704684Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.6705129Z 2025-05-07T20:32:34.6705598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.6706104Z 2025-05-07T20:32:34.6706218Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6706621Z self=, 2025-05-07T20:32:34.6707020Z T=1, 2025-05-07T20:32:34.6707207Z D=7168, 2025-05-07T20:32:34.6707399Z scale_ub=None, 2025-05-07T20:32:34.6707616Z contiguous=True, 2025-05-07T20:32:34.6707842Z compiled=False, 2025-05-07T20:32:34.6708039Z ) 2025-05-07T20:32:34.6708355Z self = 2025-05-07T20:32:34.6708840Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:34.6709096Z 2025-05-07T20:32:34.6709181Z @given( 2025-05-07T20:32:34.6709406Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6709724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6710034Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6710358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6710693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6710975Z ) 2025-05-07T20:32:34.6711318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6711760Z def test_silu_mul_quant( 2025-05-07T20:32:34.6712004Z self, 2025-05-07T20:32:34.6712206Z T: int, 2025-05-07T20:32:34.6712396Z D: int, 2025-05-07T20:32:34.6712618Z scale_ub: Optional[float], 2025-05-07T20:32:34.6712970Z contiguous: bool, 2025-05-07T20:32:34.6713208Z compiled: bool, 2025-05-07T20:32:34.6713434Z ) -> None: 2025-05-07T20:32:34.6713653Z torch.manual_seed(2025) 2025-05-07T20:32:34.6713893Z 2025-05-07T20:32:34.6714170Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6714516Z 2025-05-07T20:32:34.6714716Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6715054Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6715367Z x = x_sign * x_clamp 2025-05-07T20:32:34.6715675Z x0 = x[:, :D] 2025-05-07T20:32:34.6715895Z x1 = x[:, D:] 2025-05-07T20:32:34.6716109Z 2025-05-07T20:32:34.6716300Z if contiguous: 2025-05-07T20:32:34.6716533Z x0 = x0.contiguous() 2025-05-07T20:32:34.6716796Z x1 = x1.contiguous() 2025-05-07T20:32:34.6717031Z 2025-05-07T20:32:34.6717228Z if scale_ub is not None: 2025-05-07T20:32:34.6717501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.6717840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.6718140Z ) 2025-05-07T20:32:34.6718335Z else: 2025-05-07T20:32:34.6718545Z scale_ub_tensor = None 2025-05-07T20:32:34.6718790Z 2025-05-07T20:32:34.6719067Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.6719388Z op = silu_mul_quant 2025-05-07T20:32:34.6719632Z if compiled: 2025-05-07T20:32:34.6719886Z op = torch.compile(op) 2025-05-07T20:32:34.6720188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6720459Z 2025-05-07T20:32:34.6720661Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.6720823Z 2025-05-07T20:32:34.6720929Z moe/activation_test.py:117: 2025-05-07T20:32:34.6721226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6721552Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.6721837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6722518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.6723192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.6723773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.6724447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.6725113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.6725636Z kernel = self.compile( 2025-05-07T20:32:34.6726174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.6726821Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.6727211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6727445Z 2025-05-07T20:32:34.6727663Z self = 2025-05-07T20:32:34.6728735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.6730081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5bccc0>} 2025-05-07T20:32:34.6731402Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.6732401Z context = 2025-05-07T20:32:34.6732762Z 2025-05-07T20:32:34.6732927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.6733444Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.6733999Z module_map=module_map) 2025-05-07T20:32:34.6734361Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.6743462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.6743745Z E ^ 2025-05-07T20:32:34.6744219Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.6744759Z 2025-05-07T20:32:34.6745178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.6745696Z 2025-05-07T20:32:34.6745801Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.6746210Z self=, 2025-05-07T20:32:34.6746617Z T=16384, 2025-05-07T20:32:34.6746811Z D=7168, 2025-05-07T20:32:34.6747005Z scale_ub=1200.0, 2025-05-07T20:32:34.6747235Z contiguous=False, 2025-05-07T20:32:34.6747453Z compiled=True, 2025-05-07T20:32:34.6747660Z ) 2025-05-07T20:32:34.6748033Z self = 2025-05-07T20:32:34.6748530Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:34.6748813Z 2025-05-07T20:32:34.6748895Z @given( 2025-05-07T20:32:34.6749137Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.6749448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.6749756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.6750089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.6750419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.6750702Z ) 2025-05-07T20:32:34.6751058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.6751506Z def test_silu_mul_quant( 2025-05-07T20:32:34.6751744Z self, 2025-05-07T20:32:34.6751942Z T: int, 2025-05-07T20:32:34.6752145Z D: int, 2025-05-07T20:32:34.6752363Z scale_ub: Optional[float], 2025-05-07T20:32:34.6752641Z contiguous: bool, 2025-05-07T20:32:34.6752934Z compiled: bool, 2025-05-07T20:32:34.6753157Z ) -> None: 2025-05-07T20:32:34.6753376Z torch.manual_seed(2025) 2025-05-07T20:32:34.6753629Z 2025-05-07T20:32:34.6753896Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.6754242Z 2025-05-07T20:32:34.6754441Z x_sign = torch.sign(x) 2025-05-07T20:32:34.6754728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.6755043Z x = x_sign * x_clamp 2025-05-07T20:32:34.6755279Z x0 = x[:, :D] 2025-05-07T20:32:34.6755499Z x1 = x[:, D:] 2025-05-07T20:32:34.6755703Z 2025-05-07T20:32:34.6755885Z if contiguous: 2025-05-07T20:32:34.6756117Z x0 = x0.contiguous() 2025-05-07T20:32:34.6756373Z x1 = x1.contiguous() 2025-05-07T20:32:34.6756607Z 2025-05-07T20:32:34.6756801Z if scale_ub is not None: 2025-05-07T20:32:34.6757067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.6757396Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.6757706Z ) 2025-05-07T20:32:34.6757900Z else: 2025-05-07T20:32:34.6758106Z scale_ub_tensor = None 2025-05-07T20:32:34.6758357Z 2025-05-07T20:32:34.6758590Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.6758899Z op = silu_mul_quant 2025-05-07T20:32:34.6759149Z if compiled: 2025-05-07T20:32:34.6759397Z op = torch.compile(op) 2025-05-07T20:32:34.6759686Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6760019Z 2025-05-07T20:32:34.6760212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.6760376Z 2025-05-07T20:32:34.6760479Z moe/activation_test.py:117: 2025-05-07T20:32:34.6760767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6761101Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.6761386Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.6761936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.6762542Z return fn(*args, **kwargs) 2025-05-07T20:32:34.6763193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.6763872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.6764400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.6765073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.6765734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.6766261Z kernel = self.compile( 2025-05-07T20:32:34.6766838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.6767491Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.6767889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.6768116Z 2025-05-07T20:32:34.6768325Z self = 2025-05-07T20:32:34.6769389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.6770743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5be0c0>} 2025-05-07T20:32:34.6772110Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.6773113Z context = 2025-05-07T20:32:34.6773410Z 2025-05-07T20:32:34.6773574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.6774195Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.6774658Z module_map=module_map) 2025-05-07T20:32:34.6775020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.6775373Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.6775635Z E ^ 2025-05-07T20:32:34.6776088Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.6776536Z 2025-05-07T20:32:34.6776948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8097211Z 2025-05-07T20:32:34.8097686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8098367Z self=, 2025-05-07T20:32:34.8098846Z T=1, 2025-05-07T20:32:34.8099043Z D=7168, 2025-05-07T20:32:34.8099247Z scale_ub=None, 2025-05-07T20:32:34.8099471Z contiguous=False, 2025-05-07T20:32:34.8099703Z compiled=False, 2025-05-07T20:32:34.8099913Z ) 2025-05-07T20:32:34.8100233Z self = 2025-05-07T20:32:34.8100731Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:34.8101199Z 2025-05-07T20:32:34.8101286Z @given( 2025-05-07T20:32:34.8101525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8101845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8102150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8102499Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8102831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8103112Z ) 2025-05-07T20:32:34.8103460Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8103987Z def test_silu_mul_quant( 2025-05-07T20:32:34.8104233Z self, 2025-05-07T20:32:34.8104424Z T: int, 2025-05-07T20:32:34.8104623Z D: int, 2025-05-07T20:32:34.8104844Z scale_ub: Optional[float], 2025-05-07T20:32:34.8105111Z contiguous: bool, 2025-05-07T20:32:34.8105352Z compiled: bool, 2025-05-07T20:32:34.8105584Z ) -> None: 2025-05-07T20:32:34.8105798Z torch.manual_seed(2025) 2025-05-07T20:32:34.8106044Z 2025-05-07T20:32:34.8106320Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8106658Z 2025-05-07T20:32:34.8106857Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8107224Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8107535Z x = x_sign * x_clamp 2025-05-07T20:32:34.8107780Z x0 = x[:, :D] 2025-05-07T20:32:34.8108001Z x1 = x[:, D:] 2025-05-07T20:32:34.8108209Z 2025-05-07T20:32:34.8108400Z if contiguous: 2025-05-07T20:32:34.8108634Z x0 = x0.contiguous() 2025-05-07T20:32:34.8108896Z x1 = x1.contiguous() 2025-05-07T20:32:34.8109133Z 2025-05-07T20:32:34.8109328Z if scale_ub is not None: 2025-05-07T20:32:34.8109607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8109944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8110263Z ) 2025-05-07T20:32:34.8110467Z else: 2025-05-07T20:32:34.8110674Z scale_ub_tensor = None 2025-05-07T20:32:34.8110935Z 2025-05-07T20:32:34.8111177Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8111493Z op = silu_mul_quant 2025-05-07T20:32:34.8111750Z if compiled: 2025-05-07T20:32:34.8112084Z op = torch.compile(op) 2025-05-07T20:32:34.8112378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8112664Z 2025-05-07T20:32:34.8112871Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8113035Z 2025-05-07T20:32:34.8113141Z moe/activation_test.py:117: 2025-05-07T20:32:34.8113432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8113766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8114049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8114731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8115425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8115963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8116633Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8117291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8117822Z kernel = self.compile( 2025-05-07T20:32:34.8118364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8119009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8119406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8119638Z 2025-05-07T20:32:34.8119845Z self = 2025-05-07T20:32:34.8120957Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8122312Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5bec00>} 2025-05-07T20:32:34.8123676Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8124682Z context = 2025-05-07T20:32:34.8124962Z 2025-05-07T20:32:34.8125134Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8125647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8126108Z module_map=module_map) 2025-05-07T20:32:34.8126475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8126875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8127135Z E ^ 2025-05-07T20:32:34.8127599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8128040Z 2025-05-07T20:32:34.8128456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8128959Z 2025-05-07T20:32:34.8129069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8129473Z self=, 2025-05-07T20:32:34.8129875Z T=2048, 2025-05-07T20:32:34.8130069Z D=7168, 2025-05-07T20:32:34.8130265Z scale_ub=None, 2025-05-07T20:32:34.8130486Z contiguous=False, 2025-05-07T20:32:34.8130713Z compiled=True, 2025-05-07T20:32:34.8130912Z ) 2025-05-07T20:32:34.8131228Z self = 2025-05-07T20:32:34.8131723Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:34.8131990Z 2025-05-07T20:32:34.8132130Z @given( 2025-05-07T20:32:34.8132366Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:34.8132690Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:34.8133001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:34.8133332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:34.8133737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:34.8134021Z ) 2025-05-07T20:32:34.8134369Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:34.8134812Z def test_silu_mul_quant( 2025-05-07T20:32:34.8135101Z self, 2025-05-07T20:32:34.8135308Z T: int, 2025-05-07T20:32:34.8135505Z D: int, 2025-05-07T20:32:34.8135725Z scale_ub: Optional[float], 2025-05-07T20:32:34.8135998Z contiguous: bool, 2025-05-07T20:32:34.8136246Z compiled: bool, 2025-05-07T20:32:34.8136470Z ) -> None: 2025-05-07T20:32:34.8136692Z torch.manual_seed(2025) 2025-05-07T20:32:34.8136938Z 2025-05-07T20:32:34.8137202Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:34.8137552Z 2025-05-07T20:32:34.8137748Z x_sign = torch.sign(x) 2025-05-07T20:32:34.8138029Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:34.8138338Z x = x_sign * x_clamp 2025-05-07T20:32:34.8138577Z x0 = x[:, :D] 2025-05-07T20:32:34.8138798Z x1 = x[:, D:] 2025-05-07T20:32:34.8139002Z 2025-05-07T20:32:34.8139192Z if contiguous: 2025-05-07T20:32:34.8139423Z x0 = x0.contiguous() 2025-05-07T20:32:34.8139729Z x1 = x1.contiguous() 2025-05-07T20:32:34.8139971Z 2025-05-07T20:32:34.8140165Z if scale_ub is not None: 2025-05-07T20:32:34.8140434Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:34.8140769Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:34.8141080Z ) 2025-05-07T20:32:34.8141273Z else: 2025-05-07T20:32:34.8141487Z scale_ub_tensor = None 2025-05-07T20:32:34.8141743Z 2025-05-07T20:32:34.8141970Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:34.8142335Z op = silu_mul_quant 2025-05-07T20:32:34.8142584Z if compiled: 2025-05-07T20:32:34.8142823Z op = torch.compile(op) 2025-05-07T20:32:34.8143118Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8143393Z 2025-05-07T20:32:34.8143582Z > y_fp8, y_scale = fn() 2025-05-07T20:32:34.8143755Z 2025-05-07T20:32:34.8143853Z moe/activation_test.py:117: 2025-05-07T20:32:34.8144151Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8144480Z moe/activation_test.py:115: in fn 2025-05-07T20:32:34.8144782Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:34.8145402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:34.8145959Z return fn(*args, **kwargs) 2025-05-07T20:32:34.8146605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:34.8147293Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:34.8147834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:34.8148509Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:34.8149156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:34.8149695Z kernel = self.compile( 2025-05-07T20:32:34.8150228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:34.8150882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:34.8151323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:34.8151558Z 2025-05-07T20:32:34.8151764Z self = 2025-05-07T20:32:34.8152831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:34.8154174Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12c2c0>} 2025-05-07T20:32:34.8155489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:34.8156502Z context = 2025-05-07T20:32:34.8156792Z 2025-05-07T20:32:34.8156958Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:34.8157480Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:34.8157939Z module_map=module_map) 2025-05-07T20:32:34.8158303Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:34.8158657Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:34.8158914Z E ^ 2025-05-07T20:32:34.8159370Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:34.8159863Z 2025-05-07T20:32:34.8160272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:34.8160775Z 2025-05-07T20:32:34.8160888Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:34.8161292Z self=, 2025-05-07T20:32:34.8161689Z T=4096, 2025-05-07T20:32:34.8161876Z D=7168, 2025-05-07T20:32:34.8162065Z scale_ub=None, 2025-05-07T20:32:34.8162325Z contiguous=False, 2025-05-07T20:32:34.8162548Z compiled=True, 2025-05-07T20:32:35.2282507Z ) 2025-05-07T20:32:35.2282863Z self = 2025-05-07T20:32:35.2283370Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.2283675Z 2025-05-07T20:32:35.2283771Z @given( 2025-05-07T20:32:35.2284007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2284327Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2284637Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2284987Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2285457Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2285750Z ) 2025-05-07T20:32:35.2286095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2286528Z def test_silu_mul_quant( 2025-05-07T20:32:35.2286773Z self, 2025-05-07T20:32:35.2286966Z T: int, 2025-05-07T20:32:35.2287322Z D: int, 2025-05-07T20:32:35.2287544Z scale_ub: Optional[float], 2025-05-07T20:32:35.2287817Z contiguous: bool, 2025-05-07T20:32:35.2288048Z compiled: bool, 2025-05-07T20:32:35.2288271Z ) -> None: 2025-05-07T20:32:35.2288483Z torch.manual_seed(2025) 2025-05-07T20:32:35.2288720Z 2025-05-07T20:32:35.2288992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2289325Z 2025-05-07T20:32:35.2289519Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2289800Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2290105Z x = x_sign * x_clamp 2025-05-07T20:32:35.2290344Z x0 = x[:, :D] 2025-05-07T20:32:35.2290650Z x1 = x[:, D:] 2025-05-07T20:32:35.2290858Z 2025-05-07T20:32:35.2291041Z if contiguous: 2025-05-07T20:32:35.2291263Z x0 = x0.contiguous() 2025-05-07T20:32:35.2291520Z x1 = x1.contiguous() 2025-05-07T20:32:35.2291759Z 2025-05-07T20:32:35.2291943Z if scale_ub is not None: 2025-05-07T20:32:35.2292216Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2292549Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2292856Z ) 2025-05-07T20:32:35.2293050Z else: 2025-05-07T20:32:35.2293263Z scale_ub_tensor = None 2025-05-07T20:32:35.2293509Z 2025-05-07T20:32:35.2293818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2294131Z op = silu_mul_quant 2025-05-07T20:32:35.2294377Z if compiled: 2025-05-07T20:32:35.2294616Z op = torch.compile(op) 2025-05-07T20:32:35.2294914Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2295194Z 2025-05-07T20:32:35.2295380Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2295545Z 2025-05-07T20:32:35.2295644Z moe/activation_test.py:117: 2025-05-07T20:32:35.2295938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2296263Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2296551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2297108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.2297670Z return fn(*args, **kwargs) 2025-05-07T20:32:35.2298672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2299359Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2299896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2300563Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2301221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2301834Z kernel = self.compile( 2025-05-07T20:32:35.2302373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2303013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2303410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2303637Z 2025-05-07T20:32:35.2303849Z self = 2025-05-07T20:32:35.2304973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2306313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12cd60>} 2025-05-07T20:32:35.2307632Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2308705Z context = 2025-05-07T20:32:35.2309024Z 2025-05-07T20:32:35.2309198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2309706Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2310166Z module_map=module_map) 2025-05-07T20:32:35.2310529Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2310950Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2311204Z E ^ 2025-05-07T20:32:35.2311658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2312099Z 2025-05-07T20:32:35.2312512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2313012Z 2025-05-07T20:32:35.2313117Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2313517Z self=, 2025-05-07T20:32:35.2313913Z T=16384, 2025-05-07T20:32:35.2314108Z D=5120, 2025-05-07T20:32:35.2314336Z scale_ub=1200.0, 2025-05-07T20:32:35.2314550Z contiguous=False, 2025-05-07T20:32:35.2314774Z compiled=False, 2025-05-07T20:32:35.2314978Z ) 2025-05-07T20:32:35.2315295Z self = 2025-05-07T20:32:35.2315784Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.2316064Z 2025-05-07T20:32:35.2316141Z @given( 2025-05-07T20:32:35.2316373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2316680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2316982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2317309Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2317625Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2317907Z ) 2025-05-07T20:32:35.2318258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2318763Z def test_silu_mul_quant( 2025-05-07T20:32:35.2318992Z self, 2025-05-07T20:32:35.2319184Z T: int, 2025-05-07T20:32:35.2319379Z D: int, 2025-05-07T20:32:35.2319589Z scale_ub: Optional[float], 2025-05-07T20:32:35.2319865Z contiguous: bool, 2025-05-07T20:32:35.2320106Z compiled: bool, 2025-05-07T20:32:35.2320323Z ) -> None: 2025-05-07T20:32:35.2320537Z torch.manual_seed(2025) 2025-05-07T20:32:35.2320781Z 2025-05-07T20:32:35.2321090Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2321430Z 2025-05-07T20:32:35.2321622Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2329399Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2329754Z x = x_sign * x_clamp 2025-05-07T20:32:35.2329999Z x0 = x[:, :D] 2025-05-07T20:32:35.2330226Z x1 = x[:, D:] 2025-05-07T20:32:35.2330429Z 2025-05-07T20:32:35.2330623Z if contiguous: 2025-05-07T20:32:35.2330857Z x0 = x0.contiguous() 2025-05-07T20:32:35.2331111Z x1 = x1.contiguous() 2025-05-07T20:32:35.2331353Z 2025-05-07T20:32:35.2331551Z if scale_ub is not None: 2025-05-07T20:32:35.2331820Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2332235Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2332548Z ) 2025-05-07T20:32:35.2332744Z else: 2025-05-07T20:32:35.2332961Z scale_ub_tensor = None 2025-05-07T20:32:35.2333218Z 2025-05-07T20:32:35.2333458Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2333845Z op = silu_mul_quant 2025-05-07T20:32:35.2334100Z if compiled: 2025-05-07T20:32:35.2334353Z op = torch.compile(op) 2025-05-07T20:32:35.2334645Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2334943Z 2025-05-07T20:32:35.2335174Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2335345Z 2025-05-07T20:32:35.2335445Z moe/activation_test.py:117: 2025-05-07T20:32:35.2335746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2336086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2336365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2337108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2337801Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2338336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2339003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2339664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2340205Z kernel = self.compile( 2025-05-07T20:32:35.2340749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2341394Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2341799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2342025Z 2025-05-07T20:32:35.2342243Z self = 2025-05-07T20:32:35.2343305Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2344661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12dc60>} 2025-05-07T20:32:35.2345987Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2347044Z context = 2025-05-07T20:32:35.2347322Z 2025-05-07T20:32:35.2347490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2348010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2348521Z module_map=module_map) 2025-05-07T20:32:35.2348887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2349237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2349502Z E ^ 2025-05-07T20:32:35.2349961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2350401Z 2025-05-07T20:32:35.2350815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.2351323Z 2025-05-07T20:32:35.2351425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.2351833Z self=, 2025-05-07T20:32:35.2352276Z T=16384, 2025-05-07T20:32:35.2352468Z D=5120, 2025-05-07T20:32:35.2352671Z scale_ub=1200.0, 2025-05-07T20:32:35.2352896Z contiguous=True, 2025-05-07T20:32:35.2353113Z compiled=True, 2025-05-07T20:32:35.2353321Z ) 2025-05-07T20:32:35.2353642Z self = 2025-05-07T20:32:35.2354130Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.2354411Z 2025-05-07T20:32:35.2354488Z @given( 2025-05-07T20:32:35.2354722Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.2355037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.2355342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.2355670Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.2355997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.2356276Z ) 2025-05-07T20:32:35.2356635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.2357124Z def test_silu_mul_quant( 2025-05-07T20:32:35.2357361Z self, 2025-05-07T20:32:35.2357560Z T: int, 2025-05-07T20:32:35.2357761Z D: int, 2025-05-07T20:32:35.2357977Z scale_ub: Optional[float], 2025-05-07T20:32:35.2358249Z contiguous: bool, 2025-05-07T20:32:35.2358491Z compiled: bool, 2025-05-07T20:32:35.2358716Z ) -> None: 2025-05-07T20:32:35.2358929Z torch.manual_seed(2025) 2025-05-07T20:32:35.2359170Z 2025-05-07T20:32:35.2359441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.2359777Z 2025-05-07T20:32:35.2359978Z x_sign = torch.sign(x) 2025-05-07T20:32:35.2360271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.2360573Z x = x_sign * x_clamp 2025-05-07T20:32:35.2360813Z x0 = x[:, :D] 2025-05-07T20:32:35.2361033Z x1 = x[:, D:] 2025-05-07T20:32:35.2361240Z 2025-05-07T20:32:35.2361432Z if contiguous: 2025-05-07T20:32:35.2361670Z x0 = x0.contiguous() 2025-05-07T20:32:35.2361924Z x1 = x1.contiguous() 2025-05-07T20:32:35.2362169Z 2025-05-07T20:32:35.2362367Z if scale_ub is not None: 2025-05-07T20:32:35.2362638Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.2362975Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.2363284Z ) 2025-05-07T20:32:35.2363473Z else: 2025-05-07T20:32:35.2363685Z scale_ub_tensor = None 2025-05-07T20:32:35.2363939Z 2025-05-07T20:32:35.2364165Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.2364522Z op = silu_mul_quant 2025-05-07T20:32:35.2364769Z if compiled: 2025-05-07T20:32:35.2365022Z op = torch.compile(op) 2025-05-07T20:32:35.2365310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2365590Z 2025-05-07T20:32:35.2365787Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.2365949Z 2025-05-07T20:32:35.2366051Z moe/activation_test.py:117: 2025-05-07T20:32:35.2366344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2366722Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.2366995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.2367547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.2368098Z return fn(*args, **kwargs) 2025-05-07T20:32:35.2368751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.2369425Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.2369956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.2370701Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.2371361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.2371882Z kernel = self.compile( 2025-05-07T20:32:35.2372422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.2373069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.2373459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.2373767Z 2025-05-07T20:32:35.2373973Z self = 2025-05-07T20:32:35.2375086Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.2376477Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12f380>} 2025-05-07T20:32:35.2377794Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.2378796Z context = 2025-05-07T20:32:35.2379085Z 2025-05-07T20:32:35.2379248Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.2379759Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.2380221Z module_map=module_map) 2025-05-07T20:32:35.2380577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.2380929Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.2381192Z E ^ 2025-05-07T20:32:35.2381641Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.2382086Z 2025-05-07T20:32:35.2382493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3925474Z 2025-05-07T20:32:35.3925637Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3926237Z self=, 2025-05-07T20:32:35.3926690Z T=16384, 2025-05-07T20:32:35.3926900Z D=5120, 2025-05-07T20:32:35.3927096Z scale_ub=None, 2025-05-07T20:32:35.3927319Z contiguous=False, 2025-05-07T20:32:35.3927681Z compiled=True, 2025-05-07T20:32:35.3927885Z ) 2025-05-07T20:32:35.3928213Z self = 2025-05-07T20:32:35.3928710Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3928985Z 2025-05-07T20:32:35.3929067Z @given( 2025-05-07T20:32:35.3929308Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3929626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3930002Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3930325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3930656Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3930951Z ) 2025-05-07T20:32:35.3931293Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3931735Z def test_silu_mul_quant( 2025-05-07T20:32:35.3931981Z self, 2025-05-07T20:32:35.3932180Z T: int, 2025-05-07T20:32:35.3932384Z D: int, 2025-05-07T20:32:35.3932604Z scale_ub: Optional[float], 2025-05-07T20:32:35.3932874Z contiguous: bool, 2025-05-07T20:32:35.3933124Z compiled: bool, 2025-05-07T20:32:35.3933355Z ) -> None: 2025-05-07T20:32:35.3933637Z torch.manual_seed(2025) 2025-05-07T20:32:35.3933960Z 2025-05-07T20:32:35.3934236Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3934584Z 2025-05-07T20:32:35.3934779Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3935102Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3935441Z x = x_sign * x_clamp 2025-05-07T20:32:35.3935678Z x0 = x[:, :D] 2025-05-07T20:32:35.3935901Z x1 = x[:, D:] 2025-05-07T20:32:35.3936112Z 2025-05-07T20:32:35.3936296Z if contiguous: 2025-05-07T20:32:35.3936532Z x0 = x0.contiguous() 2025-05-07T20:32:35.3936794Z x1 = x1.contiguous() 2025-05-07T20:32:35.3937033Z 2025-05-07T20:32:35.3937230Z if scale_ub is not None: 2025-05-07T20:32:35.3937506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3937836Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3938154Z ) 2025-05-07T20:32:35.3938352Z else: 2025-05-07T20:32:35.3938633Z scale_ub_tensor = None 2025-05-07T20:32:35.3938892Z 2025-05-07T20:32:35.3939127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3939445Z op = silu_mul_quant 2025-05-07T20:32:35.3939714Z if compiled: 2025-05-07T20:32:35.3939961Z op = torch.compile(op) 2025-05-07T20:32:35.3940261Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3940542Z 2025-05-07T20:32:35.3940742Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3940912Z 2025-05-07T20:32:35.3941013Z moe/activation_test.py:117: 2025-05-07T20:32:35.3941315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3941654Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3941932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3942491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3943055Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3943712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3944396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3944933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3945605Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3946260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3946845Z kernel = self.compile( 2025-05-07T20:32:35.3947381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3948036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3948436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3948668Z 2025-05-07T20:32:35.3948875Z self = 2025-05-07T20:32:35.3949982Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3951333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8a85e0>} 2025-05-07T20:32:35.3952653Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3953705Z context = 2025-05-07T20:32:35.3953996Z 2025-05-07T20:32:35.3954164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3954684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3955151Z module_map=module_map) 2025-05-07T20:32:35.3955517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3955873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3956132Z E ^ 2025-05-07T20:32:35.3956591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3957040Z 2025-05-07T20:32:35.3957451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.3957954Z 2025-05-07T20:32:35.3958063Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.3958474Z self=, 2025-05-07T20:32:35.3958917Z T=2048, 2025-05-07T20:32:35.3959113Z D=5120, 2025-05-07T20:32:35.3959305Z scale_ub=None, 2025-05-07T20:32:35.3959521Z contiguous=False, 2025-05-07T20:32:35.3959750Z compiled=True, 2025-05-07T20:32:35.3959949Z ) 2025-05-07T20:32:35.3960271Z self = 2025-05-07T20:32:35.3960763Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:35.3961033Z 2025-05-07T20:32:35.3961116Z @given( 2025-05-07T20:32:35.3961346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.3961666Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.3961973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.3962298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.3962627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.3962919Z ) 2025-05-07T20:32:35.3963263Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.3963708Z def test_silu_mul_quant( 2025-05-07T20:32:35.3963950Z self, 2025-05-07T20:32:35.3964148Z T: int, 2025-05-07T20:32:35.3964341Z D: int, 2025-05-07T20:32:35.3964561Z scale_ub: Optional[float], 2025-05-07T20:32:35.3964837Z contiguous: bool, 2025-05-07T20:32:35.3965102Z compiled: bool, 2025-05-07T20:32:35.3965350Z ) -> None: 2025-05-07T20:32:35.3965567Z torch.manual_seed(2025) 2025-05-07T20:32:35.3965805Z 2025-05-07T20:32:35.3966072Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.3966461Z 2025-05-07T20:32:35.3966653Z x_sign = torch.sign(x) 2025-05-07T20:32:35.3966943Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.3967252Z x = x_sign * x_clamp 2025-05-07T20:32:35.3967484Z x0 = x[:, :D] 2025-05-07T20:32:35.3967702Z x1 = x[:, D:] 2025-05-07T20:32:35.3967908Z 2025-05-07T20:32:35.3968094Z if contiguous: 2025-05-07T20:32:35.3968327Z x0 = x0.contiguous() 2025-05-07T20:32:35.3968583Z x1 = x1.contiguous() 2025-05-07T20:32:35.3968869Z 2025-05-07T20:32:35.3969062Z if scale_ub is not None: 2025-05-07T20:32:35.3969336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.3969670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.3969974Z ) 2025-05-07T20:32:35.3970166Z else: 2025-05-07T20:32:35.3970373Z scale_ub_tensor = None 2025-05-07T20:32:35.3970621Z 2025-05-07T20:32:35.3970855Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.3971174Z op = silu_mul_quant 2025-05-07T20:32:35.3971419Z if compiled: 2025-05-07T20:32:35.3971665Z op = torch.compile(op) 2025-05-07T20:32:35.3972006Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3972280Z 2025-05-07T20:32:35.3972476Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.3972638Z 2025-05-07T20:32:35.3972739Z moe/activation_test.py:117: 2025-05-07T20:32:35.3973029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3973414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.3973867Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.3974430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.3974978Z return fn(*args, **kwargs) 2025-05-07T20:32:35.3975621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.3976304Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.3976830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.3977566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.3978224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.3978753Z kernel = self.compile( 2025-05-07T20:32:35.3979275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.3979923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.3980318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.3980539Z 2025-05-07T20:32:35.3980748Z self = 2025-05-07T20:32:35.3981801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.3983149Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8a9440>} 2025-05-07T20:32:35.3984460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.3985457Z context = 2025-05-07T20:32:35.3985739Z 2025-05-07T20:32:35.3985909Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.3986462Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.3986918Z module_map=module_map) 2025-05-07T20:32:35.3987279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.3987622Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.3987876Z E ^ 2025-05-07T20:32:35.3988329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.3988838Z 2025-05-07T20:32:35.3989249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5584829Z 2025-05-07T20:32:35.5585248Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5585672Z self=, 2025-05-07T20:32:35.5586126Z T=2048, 2025-05-07T20:32:35.5586318Z D=5120, 2025-05-07T20:32:35.5586523Z scale_ub=1200.0, 2025-05-07T20:32:35.5586748Z contiguous=False, 2025-05-07T20:32:35.5586975Z compiled=True, 2025-05-07T20:32:35.5587182Z ) 2025-05-07T20:32:35.5587501Z self = 2025-05-07T20:32:35.5588110Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.5588390Z 2025-05-07T20:32:35.5588471Z @given( 2025-05-07T20:32:35.5588702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5589015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5589322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5589653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5589975Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5590264Z ) 2025-05-07T20:32:35.5590611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5591051Z def test_silu_mul_quant( 2025-05-07T20:32:35.5591296Z self, 2025-05-07T20:32:35.5591492Z T: int, 2025-05-07T20:32:35.5591685Z D: int, 2025-05-07T20:32:35.5591901Z scale_ub: Optional[float], 2025-05-07T20:32:35.5592170Z contiguous: bool, 2025-05-07T20:32:35.5592414Z compiled: bool, 2025-05-07T20:32:35.5592634Z ) -> None: 2025-05-07T20:32:35.5592924Z torch.manual_seed(2025) 2025-05-07T20:32:35.5593165Z 2025-05-07T20:32:35.5593431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5593773Z 2025-05-07T20:32:35.5593966Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5594254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5594569Z x = x_sign * x_clamp 2025-05-07T20:32:35.5594809Z x0 = x[:, :D] 2025-05-07T20:32:35.5595020Z x1 = x[:, D:] 2025-05-07T20:32:35.5595229Z 2025-05-07T20:32:35.5595417Z if contiguous: 2025-05-07T20:32:35.5595643Z x0 = x0.contiguous() 2025-05-07T20:32:35.5595906Z x1 = x1.contiguous() 2025-05-07T20:32:35.5596148Z 2025-05-07T20:32:35.5596337Z if scale_ub is not None: 2025-05-07T20:32:35.5596609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5596945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5597258Z ) 2025-05-07T20:32:35.5597451Z else: 2025-05-07T20:32:35.5597666Z scale_ub_tensor = None 2025-05-07T20:32:35.5597920Z 2025-05-07T20:32:35.5598147Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5598621Z op = silu_mul_quant 2025-05-07T20:32:35.5598874Z if compiled: 2025-05-07T20:32:35.5599115Z op = torch.compile(op) 2025-05-07T20:32:35.5599411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5599682Z 2025-05-07T20:32:35.5599871Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5600040Z 2025-05-07T20:32:35.5600213Z moe/activation_test.py:117: 2025-05-07T20:32:35.5600508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5600832Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5601111Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5601674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5602226Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5602871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5603619Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5604151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5604814Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5605470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5605996Z kernel = self.compile( 2025-05-07T20:32:35.5606522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5607237Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5613386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5613643Z 2025-05-07T20:32:35.5613924Z self = 2025-05-07T20:32:35.5614993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5616341Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8aa660>} 2025-05-07T20:32:35.5617668Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5618764Z context = 2025-05-07T20:32:35.5619059Z 2025-05-07T20:32:35.5619226Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5619743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5620211Z module_map=module_map) 2025-05-07T20:32:35.5620568Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5620922Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5621180Z E ^ 2025-05-07T20:32:35.5621630Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5622084Z 2025-05-07T20:32:35.5622491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.5623002Z 2025-05-07T20:32:35.5623110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.5623514Z self=, 2025-05-07T20:32:35.5623905Z T=4096, 2025-05-07T20:32:35.5624093Z D=5120, 2025-05-07T20:32:35.5624284Z scale_ub=1200.0, 2025-05-07T20:32:35.5624500Z contiguous=True, 2025-05-07T20:32:35.5624718Z compiled=True, 2025-05-07T20:32:35.5624920Z ) 2025-05-07T20:32:35.5625226Z self = 2025-05-07T20:32:35.5625712Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.5625985Z 2025-05-07T20:32:35.5626059Z @given( 2025-05-07T20:32:35.5626285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.5626635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.5626933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.5627259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.5627576Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.5627858Z ) 2025-05-07T20:32:35.5628201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.5628627Z def test_silu_mul_quant( 2025-05-07T20:32:35.5628906Z self, 2025-05-07T20:32:35.5629095Z T: int, 2025-05-07T20:32:35.5629278Z D: int, 2025-05-07T20:32:35.5629488Z scale_ub: Optional[float], 2025-05-07T20:32:35.5629753Z contiguous: bool, 2025-05-07T20:32:35.5629984Z compiled: bool, 2025-05-07T20:32:35.5630195Z ) -> None: 2025-05-07T20:32:35.5630399Z torch.manual_seed(2025) 2025-05-07T20:32:35.5630630Z 2025-05-07T20:32:35.5630894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.5631222Z 2025-05-07T20:32:35.5631404Z x_sign = torch.sign(x) 2025-05-07T20:32:35.5631689Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.5632031Z x = x_sign * x_clamp 2025-05-07T20:32:35.5632266Z x0 = x[:, :D] 2025-05-07T20:32:35.5632476Z x1 = x[:, D:] 2025-05-07T20:32:35.5632675Z 2025-05-07T20:32:35.5632852Z if contiguous: 2025-05-07T20:32:35.5633075Z x0 = x0.contiguous() 2025-05-07T20:32:35.5633320Z x1 = x1.contiguous() 2025-05-07T20:32:35.5633552Z 2025-05-07T20:32:35.5633730Z if scale_ub is not None: 2025-05-07T20:32:35.5633988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.5634309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.5634607Z ) 2025-05-07T20:32:35.5634790Z else: 2025-05-07T20:32:35.5634987Z scale_ub_tensor = None 2025-05-07T20:32:35.5635235Z 2025-05-07T20:32:35.5635453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.5635754Z op = silu_mul_quant 2025-05-07T20:32:35.5635992Z if compiled: 2025-05-07T20:32:35.5636230Z op = torch.compile(op) 2025-05-07T20:32:35.5636564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5636832Z 2025-05-07T20:32:35.5637017Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.5637175Z 2025-05-07T20:32:35.5637271Z moe/activation_test.py:117: 2025-05-07T20:32:35.5637556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5637875Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.5638145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.5638682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.5639227Z return fn(*args, **kwargs) 2025-05-07T20:32:35.5639869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.5640531Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.5641055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.5641720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.5642364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.5642880Z kernel = self.compile( 2025-05-07T20:32:35.5643406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.5644045Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.5644430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.5644705Z 2025-05-07T20:32:35.5644905Z self = 2025-05-07T20:32:35.5646015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.5647349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8ab9c0>} 2025-05-07T20:32:35.5648702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.5649688Z context = 2025-05-07T20:32:35.5649970Z 2025-05-07T20:32:35.5650130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.5650637Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.5651091Z module_map=module_map) 2025-05-07T20:32:35.5651482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.5651829Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.5652080Z E ^ 2025-05-07T20:32:35.5652523Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.5652964Z 2025-05-07T20:32:35.5653367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7332140Z 2025-05-07T20:32:35.7332925Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7333568Z self=, 2025-05-07T20:32:35.7334284Z T=128, 2025-05-07T20:32:35.7334505Z D=5120, 2025-05-07T20:32:35.7334716Z scale_ub=1200.0, 2025-05-07T20:32:35.7334939Z contiguous=False, 2025-05-07T20:32:35.7335203Z compiled=True, 2025-05-07T20:32:35.7335439Z ) 2025-05-07T20:32:35.7335765Z self = 2025-05-07T20:32:35.7336487Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:35.7336777Z 2025-05-07T20:32:35.7336862Z @given( 2025-05-07T20:32:35.7337100Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7337416Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7337726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7338055Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7338376Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7338671Z ) 2025-05-07T20:32:35.7339023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7339463Z def test_silu_mul_quant( 2025-05-07T20:32:35.7339717Z self, 2025-05-07T20:32:35.7339923Z T: int, 2025-05-07T20:32:35.7340117Z D: int, 2025-05-07T20:32:35.7340340Z scale_ub: Optional[float], 2025-05-07T20:32:35.7340618Z contiguous: bool, 2025-05-07T20:32:35.7340867Z compiled: bool, 2025-05-07T20:32:35.7341095Z ) -> None: 2025-05-07T20:32:35.7341323Z torch.manual_seed(2025) 2025-05-07T20:32:35.7341573Z 2025-05-07T20:32:35.7341843Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7342194Z 2025-05-07T20:32:35.7342397Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7342681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7342998Z x = x_sign * x_clamp 2025-05-07T20:32:35.7343244Z x0 = x[:, :D] 2025-05-07T20:32:35.7343462Z x1 = x[:, D:] 2025-05-07T20:32:35.7343685Z 2025-05-07T20:32:35.7343982Z if contiguous: 2025-05-07T20:32:35.7344224Z x0 = x0.contiguous() 2025-05-07T20:32:35.7344495Z x1 = x1.contiguous() 2025-05-07T20:32:35.7344746Z 2025-05-07T20:32:35.7344941Z if scale_ub is not None: 2025-05-07T20:32:35.7345227Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7345571Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7345887Z ) 2025-05-07T20:32:35.7346085Z else: 2025-05-07T20:32:35.7346308Z scale_ub_tensor = None 2025-05-07T20:32:35.7346695Z 2025-05-07T20:32:35.7346924Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7347251Z op = silu_mul_quant 2025-05-07T20:32:35.7347506Z if compiled: 2025-05-07T20:32:35.7347791Z op = torch.compile(op) 2025-05-07T20:32:35.7348088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7348370Z 2025-05-07T20:32:35.7348569Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7348733Z 2025-05-07T20:32:35.7348832Z moe/activation_test.py:117: 2025-05-07T20:32:35.7349133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7349468Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7349833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7350405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7350966Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7351626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7352303Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7352841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7353515Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7354181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7354708Z kernel = self.compile( 2025-05-07T20:32:35.7355297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7356004Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7356401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7356639Z 2025-05-07T20:32:35.7356846Z self = 2025-05-07T20:32:35.7357906Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7359278Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d108fe0>} 2025-05-07T20:32:35.7360608Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7361610Z context = 2025-05-07T20:32:35.7361907Z 2025-05-07T20:32:35.7362073Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7362595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7363067Z module_map=module_map) 2025-05-07T20:32:35.7363429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7363785Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7364053Z E ^ 2025-05-07T20:32:35.7364555Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7365005Z 2025-05-07T20:32:35.7365414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.7365930Z 2025-05-07T20:32:35.7366036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.7366454Z self=, 2025-05-07T20:32:35.7366869Z T=16384, 2025-05-07T20:32:35.7367111Z D=7168, 2025-05-07T20:32:35.7367302Z scale_ub=1200.0, 2025-05-07T20:32:35.7367531Z contiguous=True, 2025-05-07T20:32:35.7367770Z compiled=True, 2025-05-07T20:32:35.7367971Z ) 2025-05-07T20:32:35.7368290Z self = 2025-05-07T20:32:35.7368785Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:35.7369057Z 2025-05-07T20:32:35.7369137Z @given( 2025-05-07T20:32:35.7369373Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.7369688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.7369989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.7370365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.7370700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.7370994Z ) 2025-05-07T20:32:35.7371334Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.7371786Z def test_silu_mul_quant( 2025-05-07T20:32:35.7372029Z self, 2025-05-07T20:32:35.7372219Z T: int, 2025-05-07T20:32:35.7372417Z D: int, 2025-05-07T20:32:35.7372636Z scale_ub: Optional[float], 2025-05-07T20:32:35.7372901Z contiguous: bool, 2025-05-07T20:32:35.7373140Z compiled: bool, 2025-05-07T20:32:35.7373366Z ) -> None: 2025-05-07T20:32:35.7373576Z torch.manual_seed(2025) 2025-05-07T20:32:35.7373922Z 2025-05-07T20:32:35.7374195Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.7374531Z 2025-05-07T20:32:35.7374728Z x_sign = torch.sign(x) 2025-05-07T20:32:35.7375026Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.7375332Z x = x_sign * x_clamp 2025-05-07T20:32:35.7375622Z x0 = x[:, :D] 2025-05-07T20:32:35.7375871Z x1 = x[:, D:] 2025-05-07T20:32:35.7376106Z 2025-05-07T20:32:35.7376288Z if contiguous: 2025-05-07T20:32:35.7376525Z x0 = x0.contiguous() 2025-05-07T20:32:35.7376791Z x1 = x1.contiguous() 2025-05-07T20:32:35.7377029Z 2025-05-07T20:32:35.7377226Z if scale_ub is not None: 2025-05-07T20:32:35.7377506Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.7377835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.7378150Z ) 2025-05-07T20:32:35.7378353Z else: 2025-05-07T20:32:35.7378562Z scale_ub_tensor = None 2025-05-07T20:32:35.7378819Z 2025-05-07T20:32:35.7379057Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.7379367Z op = silu_mul_quant 2025-05-07T20:32:35.7379629Z if compiled: 2025-05-07T20:32:35.7379884Z op = torch.compile(op) 2025-05-07T20:32:35.7380179Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7380458Z 2025-05-07T20:32:35.7380655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.7380823Z 2025-05-07T20:32:35.7380930Z moe/activation_test.py:117: 2025-05-07T20:32:35.7381221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7381552Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.7381838Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.7382388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:35.7382996Z return fn(*args, **kwargs) 2025-05-07T20:32:35.7383649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.7384328Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.7384860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.7385538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.7386248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.7386771Z kernel = self.compile( 2025-05-07T20:32:35.7387315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.7387969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.7388371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.7388595Z 2025-05-07T20:32:35.7388799Z self = 2025-05-07T20:32:35.7389911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.7391258Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d109e40>} 2025-05-07T20:32:35.7392579Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.7393576Z context = 2025-05-07T20:32:35.7393869Z 2025-05-07T20:32:35.7394034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.7394548Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.7395016Z module_map=module_map) 2025-05-07T20:32:35.7395495Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.7395850Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.7396113Z E ^ 2025-05-07T20:32:35.7396570Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.7397019Z 2025-05-07T20:32:35.7397428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8555231Z 2025-05-07T20:32:35.8555646Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8556733Z self=, 2025-05-07T20:32:35.8557863Z T=16384, 2025-05-07T20:32:35.8558345Z D=5120, 2025-05-07T20:32:35.8558811Z scale_ub=1200.0, 2025-05-07T20:32:35.8559252Z contiguous=True, 2025-05-07T20:32:35.8559673Z compiled=False, 2025-05-07T20:32:35.8560076Z ) 2025-05-07T20:32:35.8560717Z self = 2025-05-07T20:32:35.8561691Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:35.8562242Z 2025-05-07T20:32:35.8562402Z @given( 2025-05-07T20:32:35.8562858Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8563465Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8564073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8564720Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8565365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8565691Z ) 2025-05-07T20:32:35.8566337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8566785Z def test_silu_mul_quant( 2025-05-07T20:32:35.8567067Z self, 2025-05-07T20:32:35.8567272Z T: int, 2025-05-07T20:32:35.8567468Z D: int, 2025-05-07T20:32:35.8567700Z scale_ub: Optional[float], 2025-05-07T20:32:35.8567979Z contiguous: bool, 2025-05-07T20:32:35.8568215Z compiled: bool, 2025-05-07T20:32:35.8568449Z ) -> None: 2025-05-07T20:32:35.8568669Z torch.manual_seed(2025) 2025-05-07T20:32:35.8569000Z 2025-05-07T20:32:35.8569274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8569619Z 2025-05-07T20:32:35.8569811Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8570104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8570418Z x = x_sign * x_clamp 2025-05-07T20:32:35.8570651Z x0 = x[:, :D] 2025-05-07T20:32:35.8570883Z x1 = x[:, D:] 2025-05-07T20:32:35.8571107Z 2025-05-07T20:32:35.8571299Z if contiguous: 2025-05-07T20:32:35.8571545Z x0 = x0.contiguous() 2025-05-07T20:32:35.8571808Z x1 = x1.contiguous() 2025-05-07T20:32:35.8572054Z 2025-05-07T20:32:35.8572326Z if scale_ub is not None: 2025-05-07T20:32:35.8572616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8572952Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8573257Z ) 2025-05-07T20:32:35.8573461Z else: 2025-05-07T20:32:35.8573821Z scale_ub_tensor = None 2025-05-07T20:32:35.8574074Z 2025-05-07T20:32:35.8574327Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8574655Z op = silu_mul_quant 2025-05-07T20:32:35.8574904Z if compiled: 2025-05-07T20:32:35.8575163Z op = torch.compile(op) 2025-05-07T20:32:35.8575468Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8575747Z 2025-05-07T20:32:35.8575953Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8576120Z 2025-05-07T20:32:35.8576233Z moe/activation_test.py:117: 2025-05-07T20:32:35.8576539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8576877Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8577258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8577952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8578638Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8579181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8579866Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8580531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8581063Z kernel = self.compile( 2025-05-07T20:32:35.8581606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8582262Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8582659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8582895Z 2025-05-07T20:32:35.8583102Z self = 2025-05-07T20:32:35.8584178Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8585596Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d10aca0>} 2025-05-07T20:32:35.8586999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8588001Z context = 2025-05-07T20:32:35.8588299Z 2025-05-07T20:32:35.8588469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8588993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8589504Z module_map=module_map) 2025-05-07T20:32:35.8589866Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8590227Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8590494Z E ^ 2025-05-07T20:32:35.8590948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8591400Z 2025-05-07T20:32:35.8591807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8592321Z 2025-05-07T20:32:35.8592426Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8592879Z self=, 2025-05-07T20:32:35.8593280Z T=1, 2025-05-07T20:32:35.8593469Z D=7168, 2025-05-07T20:32:35.8593667Z scale_ub=1200.0, 2025-05-07T20:32:35.8593895Z contiguous=False, 2025-05-07T20:32:35.8594127Z compiled=False, 2025-05-07T20:32:35.8594338Z ) 2025-05-07T20:32:35.8594659Z self = 2025-05-07T20:32:35.8595147Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:35.8595407Z 2025-05-07T20:32:35.8595494Z @given( 2025-05-07T20:32:35.8595723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:35.8596044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:35.8596356Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:35.8596692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:35.8597021Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:35.8597316Z ) 2025-05-07T20:32:35.8597713Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:35.8598444Z def test_silu_mul_quant( 2025-05-07T20:32:35.8598695Z self, 2025-05-07T20:32:35.8598901Z T: int, 2025-05-07T20:32:35.8599097Z D: int, 2025-05-07T20:32:35.8599320Z scale_ub: Optional[float], 2025-05-07T20:32:35.8599595Z contiguous: bool, 2025-05-07T20:32:35.8599832Z compiled: bool, 2025-05-07T20:32:35.8600061Z ) -> None: 2025-05-07T20:32:35.8600283Z torch.manual_seed(2025) 2025-05-07T20:32:35.8600536Z 2025-05-07T20:32:35.8600815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:35.8601167Z 2025-05-07T20:32:35.8601370Z x_sign = torch.sign(x) 2025-05-07T20:32:35.8601656Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:35.8601973Z x = x_sign * x_clamp 2025-05-07T20:32:35.8610514Z x0 = x[:, :D] 2025-05-07T20:32:35.8610778Z x1 = x[:, D:] 2025-05-07T20:32:35.8611002Z 2025-05-07T20:32:35.8611204Z if contiguous: 2025-05-07T20:32:35.8611439Z x0 = x0.contiguous() 2025-05-07T20:32:35.8611706Z x1 = x1.contiguous() 2025-05-07T20:32:35.8611955Z 2025-05-07T20:32:35.8612145Z if scale_ub is not None: 2025-05-07T20:32:35.8612427Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:35.8612768Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:35.8613083Z ) 2025-05-07T20:32:35.8613273Z else: 2025-05-07T20:32:35.8613488Z scale_ub_tensor = None 2025-05-07T20:32:35.8613885Z 2025-05-07T20:32:35.8614243Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:35.8614565Z op = silu_mul_quant 2025-05-07T20:32:35.8614819Z if compiled: 2025-05-07T20:32:35.8615065Z op = torch.compile(op) 2025-05-07T20:32:35.8615366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8615646Z 2025-05-07T20:32:35.8615840Z > y_fp8, y_scale = fn() 2025-05-07T20:32:35.8616014Z 2025-05-07T20:32:35.8616113Z moe/activation_test.py:117: 2025-05-07T20:32:35.8616493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8616820Z moe/activation_test.py:115: in fn 2025-05-07T20:32:35.8617102Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:35.8617792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:35.8618476Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:35.8619007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:35.8619684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:35.8620410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:35.8620947Z kernel = self.compile( 2025-05-07T20:32:35.8621479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:35.8622133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:35.8622536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:35.8622762Z 2025-05-07T20:32:35.8622966Z self = 2025-05-07T20:32:35.8624031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:35.8625381Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20c0e0>} 2025-05-07T20:32:35.8626766Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:35.8627775Z context = 2025-05-07T20:32:35.8628060Z 2025-05-07T20:32:35.8628223Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:35.8628737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:35.8629201Z module_map=module_map) 2025-05-07T20:32:35.8629558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:35.8629912Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:35.8630174Z E ^ 2025-05-07T20:32:35.8630633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:35.8631068Z 2025-05-07T20:32:35.8631479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:35.8631993Z 2025-05-07T20:32:35.8632097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:35.8632506Z self=, 2025-05-07T20:32:35.8632903Z T=4096, 2025-05-07T20:32:35.8633085Z D=7168, 2025-05-07T20:32:35.8633282Z scale_ub=1200.0, 2025-05-07T20:32:35.8633507Z contiguous=False, 2025-05-07T20:32:35.8633719Z compiled=True, 2025-05-07T20:32:36.0233104Z ) 2025-05-07T20:32:36.0233997Z self = 2025-05-07T20:32:36.0234513Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:36.0234800Z 2025-05-07T20:32:36.0234883Z @given( 2025-05-07T20:32:36.0235133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.0235480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.0235824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.0236163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.0236611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.0236904Z ) 2025-05-07T20:32:36.0237262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.0237711Z def test_silu_mul_quant( 2025-05-07T20:32:36.0237953Z self, 2025-05-07T20:32:36.0238161Z T: int, 2025-05-07T20:32:36.0238372Z D: int, 2025-05-07T20:32:36.0238597Z scale_ub: Optional[float], 2025-05-07T20:32:36.0238881Z contiguous: bool, 2025-05-07T20:32:36.0239133Z compiled: bool, 2025-05-07T20:32:36.0239363Z ) -> None: 2025-05-07T20:32:36.0239588Z torch.manual_seed(2025) 2025-05-07T20:32:36.0239839Z 2025-05-07T20:32:36.0240210Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.0240565Z 2025-05-07T20:32:36.0240771Z x_sign = torch.sign(x) 2025-05-07T20:32:36.0241063Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.0241384Z x = x_sign * x_clamp 2025-05-07T20:32:36.0241634Z x0 = x[:, :D] 2025-05-07T20:32:36.0241871Z x1 = x[:, D:] 2025-05-07T20:32:36.0242082Z 2025-05-07T20:32:36.0242286Z if contiguous: 2025-05-07T20:32:36.0242525Z x0 = x0.contiguous() 2025-05-07T20:32:36.0242778Z x1 = x1.contiguous() 2025-05-07T20:32:36.0243024Z 2025-05-07T20:32:36.0243217Z if scale_ub is not None: 2025-05-07T20:32:36.0243487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.0243820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.0244133Z ) 2025-05-07T20:32:36.0244322Z else: 2025-05-07T20:32:36.0244539Z scale_ub_tensor = None 2025-05-07T20:32:36.0244799Z 2025-05-07T20:32:36.0245111Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.0245435Z op = silu_mul_quant 2025-05-07T20:32:36.0245686Z if compiled: 2025-05-07T20:32:36.0245931Z op = torch.compile(op) 2025-05-07T20:32:36.0246224Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0246504Z 2025-05-07T20:32:36.0246693Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.0246862Z 2025-05-07T20:32:36.0246961Z moe/activation_test.py:117: 2025-05-07T20:32:36.0247257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0247588Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.0247871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0248429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.0248986Z return fn(*args, **kwargs) 2025-05-07T20:32:36.0249639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.0250319Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.0250856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.0251527Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.0252188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.0252720Z kernel = self.compile( 2025-05-07T20:32:36.0253260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.0254062Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.0254460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0254696Z 2025-05-07T20:32:36.0254909Z self = 2025-05-07T20:32:36.0256022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.0257439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20d300>} 2025-05-07T20:32:36.0258757Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.0259770Z context = 2025-05-07T20:32:36.0260059Z 2025-05-07T20:32:36.0260268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.0260791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.0261252Z module_map=module_map) 2025-05-07T20:32:36.0261626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.0261981Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.0262242Z E ^ 2025-05-07T20:32:36.0262706Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.0263154Z 2025-05-07T20:32:36.0263564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.0264070Z 2025-05-07T20:32:36.0264183Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.0264588Z self=, 2025-05-07T20:32:36.0264996Z T=128, 2025-05-07T20:32:36.0265192Z D=7168, 2025-05-07T20:32:36.0265393Z scale_ub=1200.0, 2025-05-07T20:32:36.0265686Z contiguous=False, 2025-05-07T20:32:36.0265915Z compiled=True, 2025-05-07T20:32:36.0266115Z ) 2025-05-07T20:32:36.0266436Z self = 2025-05-07T20:32:36.0266923Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:36.0267190Z 2025-05-07T20:32:36.0267273Z @given( 2025-05-07T20:32:36.0267501Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.0267816Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.0268122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.0268450Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.0268779Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.0269068Z ) 2025-05-07T20:32:36.0269411Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.0269854Z def test_silu_mul_quant( 2025-05-07T20:32:36.0270100Z self, 2025-05-07T20:32:36.0270297Z T: int, 2025-05-07T20:32:36.0270490Z D: int, 2025-05-07T20:32:36.0270712Z scale_ub: Optional[float], 2025-05-07T20:32:36.0270989Z contiguous: bool, 2025-05-07T20:32:36.0271221Z compiled: bool, 2025-05-07T20:32:36.0271443Z ) -> None: 2025-05-07T20:32:36.0271657Z torch.manual_seed(2025) 2025-05-07T20:32:36.0271895Z 2025-05-07T20:32:36.0272167Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.0272512Z 2025-05-07T20:32:36.0272703Z x_sign = torch.sign(x) 2025-05-07T20:32:36.0273042Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.0273350Z x = x_sign * x_clamp 2025-05-07T20:32:36.0273585Z x0 = x[:, :D] 2025-05-07T20:32:36.0273804Z x1 = x[:, D:] 2025-05-07T20:32:36.0274020Z 2025-05-07T20:32:36.0274206Z if contiguous: 2025-05-07T20:32:36.0274443Z x0 = x0.contiguous() 2025-05-07T20:32:36.0274721Z x1 = x1.contiguous() 2025-05-07T20:32:36.0274965Z 2025-05-07T20:32:36.0275153Z if scale_ub is not None: 2025-05-07T20:32:36.0275475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.0275814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.0276118Z ) 2025-05-07T20:32:36.0276316Z else: 2025-05-07T20:32:36.0276529Z scale_ub_tensor = None 2025-05-07T20:32:36.0276779Z 2025-05-07T20:32:36.0277011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.0278657Z op = silu_mul_quant 2025-05-07T20:32:36.0278903Z if compiled: 2025-05-07T20:32:36.0279155Z op = torch.compile(op) 2025-05-07T20:32:36.0279454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0279724Z 2025-05-07T20:32:36.0279925Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.0280140Z 2025-05-07T20:32:36.0280239Z moe/activation_test.py:117: 2025-05-07T20:32:36.0280540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0280868Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.0281151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.0281703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.0282246Z return fn(*args, **kwargs) 2025-05-07T20:32:36.0282897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.0283576Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.0284108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.0284773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.0285478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.0286011Z kernel = self.compile( 2025-05-07T20:32:36.0286540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.0287196Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.0287599Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.0287826Z 2025-05-07T20:32:36.0288036Z self = 2025-05-07T20:32:36.0289089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.0290443Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20e160>} 2025-05-07T20:32:36.0291764Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.0292766Z context = 2025-05-07T20:32:36.0293050Z 2025-05-07T20:32:36.0293220Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.0293837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.0294349Z module_map=module_map) 2025-05-07T20:32:36.0294712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.0295059Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.0295321Z E ^ 2025-05-07T20:32:36.0295787Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.0296222Z 2025-05-07T20:32:36.0296634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.0297176Z 2025-05-07T20:32:36.0297280Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.0297690Z self=, 2025-05-07T20:32:36.0298087Z T=2048, 2025-05-07T20:32:36.0298643Z D=7168, 2025-05-07T20:32:36.0298845Z scale_ub=None, 2025-05-07T20:32:36.0299063Z contiguous=True, 2025-05-07T20:32:36.0299280Z compiled=True, 2025-05-07T20:32:36.1556419Z ) 2025-05-07T20:32:36.1556891Z self = 2025-05-07T20:32:36.1557583Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.1557944Z 2025-05-07T20:32:36.1558358Z @given( 2025-05-07T20:32:36.1558608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1558926Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1559230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1559567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1559895Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1560191Z ) 2025-05-07T20:32:36.1560538Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1560983Z def test_silu_mul_quant( 2025-05-07T20:32:36.1561225Z self, 2025-05-07T20:32:36.1561421Z T: int, 2025-05-07T20:32:36.1561629Z D: int, 2025-05-07T20:32:36.1561847Z scale_ub: Optional[float], 2025-05-07T20:32:36.1562114Z contiguous: bool, 2025-05-07T20:32:36.1562356Z compiled: bool, 2025-05-07T20:32:36.1562584Z ) -> None: 2025-05-07T20:32:36.1562795Z torch.manual_seed(2025) 2025-05-07T20:32:36.1563044Z 2025-05-07T20:32:36.1563403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1563742Z 2025-05-07T20:32:36.1563940Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1564234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1564551Z x = x_sign * x_clamp 2025-05-07T20:32:36.1564788Z x0 = x[:, :D] 2025-05-07T20:32:36.1565009Z x1 = x[:, D:] 2025-05-07T20:32:36.1565233Z 2025-05-07T20:32:36.1565426Z if contiguous: 2025-05-07T20:32:36.1565694Z x0 = x0.contiguous() 2025-05-07T20:32:36.1565953Z x1 = x1.contiguous() 2025-05-07T20:32:36.1566197Z 2025-05-07T20:32:36.1566388Z if scale_ub is not None: 2025-05-07T20:32:36.1566664Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.1567000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.1567311Z ) 2025-05-07T20:32:36.1567504Z else: 2025-05-07T20:32:36.1567721Z scale_ub_tensor = None 2025-05-07T20:32:36.1567978Z 2025-05-07T20:32:36.1568206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.1568523Z op = silu_mul_quant 2025-05-07T20:32:36.1568782Z if compiled: 2025-05-07T20:32:36.1569027Z op = torch.compile(op) 2025-05-07T20:32:36.1569330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1569608Z 2025-05-07T20:32:36.1569800Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.1569969Z 2025-05-07T20:32:36.1570074Z moe/activation_test.py:117: 2025-05-07T20:32:36.1570374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1570802Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.1571083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.1571639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.1572202Z return fn(*args, **kwargs) 2025-05-07T20:32:36.1572856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.1573612Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.1574299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.1574972Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.1575625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.1576161Z kernel = self.compile( 2025-05-07T20:32:36.1576704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.1577350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.1577799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.1578036Z 2025-05-07T20:32:36.1578247Z self = 2025-05-07T20:32:36.1579313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.1580677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20f420>} 2025-05-07T20:32:36.1582004Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.1583016Z context = 2025-05-07T20:32:36.1583304Z 2025-05-07T20:32:36.1583519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.1584047Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.1584514Z module_map=module_map) 2025-05-07T20:32:36.1584880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.1585239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.1585498Z E ^ 2025-05-07T20:32:36.1585958Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.1586400Z 2025-05-07T20:32:36.1586818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.1587326Z 2025-05-07T20:32:36.1587441Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1587851Z self=, 2025-05-07T20:32:36.1588256Z T=16384, 2025-05-07T20:32:36.1588456Z D=5120, 2025-05-07T20:32:36.1588648Z scale_ub=None, 2025-05-07T20:32:36.1588866Z contiguous=False, 2025-05-07T20:32:36.1589098Z compiled=False, 2025-05-07T20:32:36.1589307Z ) 2025-05-07T20:32:36.1589628Z self = 2025-05-07T20:32:36.1590127Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.1590401Z 2025-05-07T20:32:36.1590491Z @given( 2025-05-07T20:32:36.1590719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1591035Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1591391Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1591719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1592051Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1592342Z ) 2025-05-07T20:32:36.1592689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1593139Z def test_silu_mul_quant( 2025-05-07T20:32:36.1593384Z self, 2025-05-07T20:32:36.1593578Z T: int, 2025-05-07T20:32:36.1593828Z D: int, 2025-05-07T20:32:36.1594054Z scale_ub: Optional[float], 2025-05-07T20:32:36.1594328Z contiguous: bool, 2025-05-07T20:32:36.1594567Z compiled: bool, 2025-05-07T20:32:36.1594794Z ) -> None: 2025-05-07T20:32:36.1595012Z torch.manual_seed(2025) 2025-05-07T20:32:36.1595278Z 2025-05-07T20:32:36.1595574Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1595914Z 2025-05-07T20:32:36.1596103Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1596393Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1598902Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.1600737Z 2025-05-07T20:32:36.1600866Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.1601078Z 2025-05-07T20:32:36.1601188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1601593Z self=, 2025-05-07T20:32:36.1602000Z T=4096, 2025-05-07T20:32:36.1602191Z D=7168, 2025-05-07T20:32:36.1602383Z scale_ub=1200.0, 2025-05-07T20:32:36.1602606Z contiguous=True, 2025-05-07T20:32:36.1602831Z compiled=True, 2025-05-07T20:32:36.1603030Z ) 2025-05-07T20:32:36.1603415Z self = 2025-05-07T20:32:36.1603909Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.1604176Z 2025-05-07T20:32:36.1604256Z @given( 2025-05-07T20:32:36.1604491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1604805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1605124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1605488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1605817Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1606106Z ) 2025-05-07T20:32:36.1606455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1606898Z def test_silu_mul_quant( 2025-05-07T20:32:36.1607144Z self, 2025-05-07T20:32:36.1607337Z T: int, 2025-05-07T20:32:36.1607541Z D: int, 2025-05-07T20:32:36.1607767Z scale_ub: Optional[float], 2025-05-07T20:32:36.1608039Z contiguous: bool, 2025-05-07T20:32:36.1608286Z compiled: bool, 2025-05-07T20:32:36.1608515Z ) -> None: 2025-05-07T20:32:36.1608734Z torch.manual_seed(2025) 2025-05-07T20:32:36.1608991Z 2025-05-07T20:32:36.1609270Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1609608Z 2025-05-07T20:32:36.1609814Z x_sign = torch.sign(x) 2025-05-07T20:32:36.1610111Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.1612080Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.1614052Z 2025-05-07T20:32:36.1614184Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.1614459Z 2025-05-07T20:32:36.1614568Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.1614986Z self=, 2025-05-07T20:32:36.1623459Z T=16384, 2025-05-07T20:32:36.1623679Z D=7168, 2025-05-07T20:32:36.1623881Z scale_ub=None, 2025-05-07T20:32:36.1624098Z contiguous=False, 2025-05-07T20:32:36.1624334Z compiled=False, 2025-05-07T20:32:36.1624548Z ) 2025-05-07T20:32:36.1624863Z self = 2025-05-07T20:32:36.1625363Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.1625639Z 2025-05-07T20:32:36.1625724Z @given( 2025-05-07T20:32:36.1626050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.1626374Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.1626686Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.1627022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.1627355Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.1627647Z ) 2025-05-07T20:32:36.1628002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.1628445Z def test_silu_mul_quant( 2025-05-07T20:32:36.1628698Z self, 2025-05-07T20:32:36.1628906Z T: int, 2025-05-07T20:32:36.1629104Z D: int, 2025-05-07T20:32:36.1629333Z scale_ub: Optional[float], 2025-05-07T20:32:36.1629614Z contiguous: bool, 2025-05-07T20:32:36.1629863Z compiled: bool, 2025-05-07T20:32:36.1630096Z ) -> None: 2025-05-07T20:32:36.1630320Z torch.manual_seed(2025) 2025-05-07T20:32:36.1630563Z 2025-05-07T20:32:36.1630894Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.1632924Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.1634765Z 2025-05-07T20:32:36.1634886Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.2857183Z 2025-05-07T20:32:36.2857767Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2858390Z self=, 2025-05-07T20:32:36.2858975Z T=2048, 2025-05-07T20:32:36.2859295Z D=7168, 2025-05-07T20:32:36.2859562Z scale_ub=1200.0, 2025-05-07T20:32:36.2859799Z contiguous=True, 2025-05-07T20:32:36.2860032Z compiled=True, 2025-05-07T20:32:36.2860254Z ) 2025-05-07T20:32:36.2860569Z self = 2025-05-07T20:32:36.2861068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.2861343Z 2025-05-07T20:32:36.2861429Z @given( 2025-05-07T20:32:36.2861668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2861983Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2862566Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2862902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2863232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2863529Z ) 2025-05-07T20:32:36.2863885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2864326Z def test_silu_mul_quant( 2025-05-07T20:32:36.2864577Z self, 2025-05-07T20:32:36.2864781Z T: int, 2025-05-07T20:32:36.2864987Z D: int, 2025-05-07T20:32:36.2865300Z scale_ub: Optional[float], 2025-05-07T20:32:36.2865579Z contiguous: bool, 2025-05-07T20:32:36.2865823Z compiled: bool, 2025-05-07T20:32:36.2866047Z ) -> None: 2025-05-07T20:32:36.2866267Z torch.manual_seed(2025) 2025-05-07T20:32:36.2866513Z 2025-05-07T20:32:36.2866783Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2867126Z 2025-05-07T20:32:36.2867332Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2867622Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2869672Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2871502Z 2025-05-07T20:32:36.2871624Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:36.2871841Z 2025-05-07T20:32:36.2871948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2872369Z self=, 2025-05-07T20:32:36.2872776Z T=2048, 2025-05-07T20:32:36.2872974Z D=7168, 2025-05-07T20:32:36.2873178Z scale_ub=None, 2025-05-07T20:32:36.2873392Z contiguous=True, 2025-05-07T20:32:36.2873620Z compiled=False, 2025-05-07T20:32:36.2873829Z ) 2025-05-07T20:32:36.2874146Z self = 2025-05-07T20:32:36.2874714Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.2874985Z 2025-05-07T20:32:36.2875074Z @given( 2025-05-07T20:32:36.2875312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2875623Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2875933Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2876270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2876601Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2876896Z ) 2025-05-07T20:32:36.2877249Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2877694Z def test_silu_mul_quant( 2025-05-07T20:32:36.2877943Z self, 2025-05-07T20:32:36.2878145Z T: int, 2025-05-07T20:32:36.2878346Z D: int, 2025-05-07T20:32:36.2878572Z scale_ub: Optional[float], 2025-05-07T20:32:36.2878854Z contiguous: bool, 2025-05-07T20:32:36.2879096Z compiled: bool, 2025-05-07T20:32:36.2879330Z ) -> None: 2025-05-07T20:32:36.2879554Z torch.manual_seed(2025) 2025-05-07T20:32:36.2879798Z 2025-05-07T20:32:36.2880075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2880422Z 2025-05-07T20:32:36.2880624Z > x_sign = torch.sign(x) 2025-05-07T20:32:36.2882510Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.2884365Z 2025-05-07T20:32:36.2884488Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:36.2884707Z 2025-05-07T20:32:36.2884813Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2885264Z self=, 2025-05-07T20:32:36.2885669Z T=1, 2025-05-07T20:32:36.2885852Z D=7168, 2025-05-07T20:32:36.2886051Z scale_ub=1200.0, 2025-05-07T20:32:36.2886278Z contiguous=True, 2025-05-07T20:32:36.2886499Z compiled=False, 2025-05-07T20:32:36.2886710Z ) 2025-05-07T20:32:36.2887029Z self = 2025-05-07T20:32:36.2887509Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.2887779Z 2025-05-07T20:32:36.2887860Z @given( 2025-05-07T20:32:36.2888097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2888409Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2888764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2889102Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2889428Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2889727Z ) 2025-05-07T20:32:36.2890078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2890523Z def test_silu_mul_quant( 2025-05-07T20:32:36.2890767Z self, 2025-05-07T20:32:36.2890968Z T: int, 2025-05-07T20:32:36.2891172Z D: int, 2025-05-07T20:32:36.2891392Z scale_ub: Optional[float], 2025-05-07T20:32:36.2891667Z contiguous: bool, 2025-05-07T20:32:36.2891916Z compiled: bool, 2025-05-07T20:32:36.2892138Z ) -> None: 2025-05-07T20:32:36.2892359Z torch.manual_seed(2025) 2025-05-07T20:32:36.2892610Z 2025-05-07T20:32:36.2892880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2893230Z 2025-05-07T20:32:36.2893436Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2893917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2894237Z x = x_sign * x_clamp 2025-05-07T20:32:36.2894486Z x0 = x[:, :D] 2025-05-07T20:32:36.2894705Z x1 = x[:, D:] 2025-05-07T20:32:36.2894922Z 2025-05-07T20:32:36.2895120Z if contiguous: 2025-05-07T20:32:36.2895360Z x0 = x0.contiguous() 2025-05-07T20:32:36.2895621Z x1 = x1.contiguous() 2025-05-07T20:32:36.2895867Z 2025-05-07T20:32:36.2896069Z if scale_ub is not None: 2025-05-07T20:32:36.2896344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2896687Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2897004Z ) 2025-05-07T20:32:36.2897204Z else: 2025-05-07T20:32:36.2897427Z scale_ub_tensor = None 2025-05-07T20:32:36.2897685Z 2025-05-07T20:32:36.2897923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2898519Z op = silu_mul_quant 2025-05-07T20:32:36.2898779Z if compiled: 2025-05-07T20:32:36.2899026Z op = torch.compile(op) 2025-05-07T20:32:36.2899328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2899612Z 2025-05-07T20:32:36.2899805Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.2899978Z 2025-05-07T20:32:36.2900083Z moe/activation_test.py:117: 2025-05-07T20:32:36.2900383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2900719Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.2901000Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2901761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.2902449Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.2902986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2903669Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2904333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2904958Z kernel = self.compile( 2025-05-07T20:32:36.2905498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2906156Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2906559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2906790Z 2025-05-07T20:32:36.2907002Z self = 2025-05-07T20:32:36.2908129Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2909483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d7b22a0>} 2025-05-07T20:32:36.2910815Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2911822Z context = 2025-05-07T20:32:36.2912106Z 2025-05-07T20:32:36.2912271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2912794Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2913263Z module_map=module_map) 2025-05-07T20:32:36.2913639Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2914056Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.2914322Z E ^ 2025-05-07T20:32:36.2914784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2915227Z 2025-05-07T20:32:36.2915659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.2916205Z 2025-05-07T20:32:36.2916310Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.2916723Z self=, 2025-05-07T20:32:36.2917143Z T=128, 2025-05-07T20:32:36.2917338Z D=5120, 2025-05-07T20:32:36.2917537Z scale_ub=None, 2025-05-07T20:32:36.2917752Z contiguous=True, 2025-05-07T20:32:36.2917981Z compiled=False, 2025-05-07T20:32:36.2918191Z ) 2025-05-07T20:32:36.2918513Z self = 2025-05-07T20:32:36.2919007Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.2919276Z 2025-05-07T20:32:36.2919362Z @given( 2025-05-07T20:32:36.2919592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.2919917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.2920231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.2920567Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.2920896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.2921188Z ) 2025-05-07T20:32:36.2921542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.2922029Z def test_silu_mul_quant( 2025-05-07T20:32:36.2922277Z self, 2025-05-07T20:32:36.2922478Z T: int, 2025-05-07T20:32:36.2922672Z D: int, 2025-05-07T20:32:36.2922890Z scale_ub: Optional[float], 2025-05-07T20:32:36.2923167Z contiguous: bool, 2025-05-07T20:32:36.2923404Z compiled: bool, 2025-05-07T20:32:36.2923632Z ) -> None: 2025-05-07T20:32:36.2923850Z torch.manual_seed(2025) 2025-05-07T20:32:36.2924090Z 2025-05-07T20:32:36.2924408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.2924754Z 2025-05-07T20:32:36.2924945Z x_sign = torch.sign(x) 2025-05-07T20:32:36.2925236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.2925551Z x = x_sign * x_clamp 2025-05-07T20:32:36.2925797Z x0 = x[:, :D] 2025-05-07T20:32:36.2926012Z x1 = x[:, D:] 2025-05-07T20:32:36.2926226Z 2025-05-07T20:32:36.2926421Z if contiguous: 2025-05-07T20:32:36.2926649Z x0 = x0.contiguous() 2025-05-07T20:32:36.2926914Z x1 = x1.contiguous() 2025-05-07T20:32:36.2927156Z 2025-05-07T20:32:36.2927348Z if scale_ub is not None: 2025-05-07T20:32:36.2927626Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.2928018Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.2928328Z ) 2025-05-07T20:32:36.2928527Z else: 2025-05-07T20:32:36.2928746Z scale_ub_tensor = None 2025-05-07T20:32:36.2928999Z 2025-05-07T20:32:36.2929235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.2929553Z op = silu_mul_quant 2025-05-07T20:32:36.2929807Z if compiled: 2025-05-07T20:32:36.2930058Z op = torch.compile(op) 2025-05-07T20:32:36.2930361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2930640Z 2025-05-07T20:32:36.2930833Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.2931011Z 2025-05-07T20:32:36.2931113Z moe/activation_test.py:117: 2025-05-07T20:32:36.2931425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2931759Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.2932053Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.2932785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.2933464Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.2934076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.2934758Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.2935421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.2935953Z kernel = self.compile( 2025-05-07T20:32:36.2936501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.2937152Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.2937562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.2937790Z 2025-05-07T20:32:36.2938000Z self = 2025-05-07T20:32:36.2939067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.2940420Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d7b31a0>} 2025-05-07T20:32:36.2941742Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.2942792Z context = 2025-05-07T20:32:36.2943081Z 2025-05-07T20:32:36.2943251Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.2943769Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.2944277Z module_map=module_map) 2025-05-07T20:32:36.2944638Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.2945000Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.2945288Z E ^ 2025-05-07T20:32:36.2945767Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.2946213Z 2025-05-07T20:32:36.2946623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4076086Z 2025-05-07T20:32:36.4076291Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4076885Z self=, 2025-05-07T20:32:36.4077636Z T=128, 2025-05-07T20:32:36.4077902Z D=7168, 2025-05-07T20:32:36.4078170Z scale_ub=None, 2025-05-07T20:32:36.4078466Z contiguous=True, 2025-05-07T20:32:36.4078702Z compiled=False, 2025-05-07T20:32:36.4078911Z ) 2025-05-07T20:32:36.4079229Z self = 2025-05-07T20:32:36.4079715Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.4079978Z 2025-05-07T20:32:36.4080064Z @given( 2025-05-07T20:32:36.4080290Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4080611Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4080925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4081253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4081587Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4081881Z ) 2025-05-07T20:32:36.4082270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4082809Z def test_silu_mul_quant( 2025-05-07T20:32:36.4083053Z self, 2025-05-07T20:32:36.4083254Z T: int, 2025-05-07T20:32:36.4083456Z D: int, 2025-05-07T20:32:36.4083674Z scale_ub: Optional[float], 2025-05-07T20:32:36.4083952Z contiguous: bool, 2025-05-07T20:32:36.4084194Z compiled: bool, 2025-05-07T20:32:36.4084419Z ) -> None: 2025-05-07T20:32:36.4084639Z torch.manual_seed(2025) 2025-05-07T20:32:36.4084886Z 2025-05-07T20:32:36.4085153Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4085498Z 2025-05-07T20:32:36.4085701Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4085992Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4086296Z x = x_sign * x_clamp 2025-05-07T20:32:36.4086536Z x0 = x[:, :D] 2025-05-07T20:32:36.4086758Z x1 = x[:, D:] 2025-05-07T20:32:36.4086963Z 2025-05-07T20:32:36.4087158Z if contiguous: 2025-05-07T20:32:36.4087397Z x0 = x0.contiguous() 2025-05-07T20:32:36.4087653Z x1 = x1.contiguous() 2025-05-07T20:32:36.4087897Z 2025-05-07T20:32:36.4088096Z if scale_ub is not None: 2025-05-07T20:32:36.4088366Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4088701Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4089015Z ) 2025-05-07T20:32:36.4089209Z else: 2025-05-07T20:32:36.4089426Z scale_ub_tensor = None 2025-05-07T20:32:36.4089686Z 2025-05-07T20:32:36.4089919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4090329Z op = silu_mul_quant 2025-05-07T20:32:36.4090586Z if compiled: 2025-05-07T20:32:36.4090836Z op = torch.compile(op) 2025-05-07T20:32:36.4091130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4091409Z 2025-05-07T20:32:36.4091612Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.4091775Z 2025-05-07T20:32:36.4091877Z moe/activation_test.py:117: 2025-05-07T20:32:36.4092176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4092598Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.4092879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4093566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.4094388Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.4094929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4095603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4096269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4096899Z kernel = self.compile( 2025-05-07T20:32:36.4097440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4098100Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4098782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4099012Z 2025-05-07T20:32:36.4099223Z self = 2025-05-07T20:32:36.4100290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4101650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779ced4040>} 2025-05-07T20:32:36.4103098Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4104111Z context = 2025-05-07T20:32:36.4104397Z 2025-05-07T20:32:36.4104574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4105089Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4105610Z module_map=module_map) 2025-05-07T20:32:36.4105980Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4106333Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.4106596Z E ^ 2025-05-07T20:32:36.4107058Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4107501Z 2025-05-07T20:32:36.4107928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4108436Z 2025-05-07T20:32:36.4108539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4108957Z self=, 2025-05-07T20:32:36.4109359Z T=2048, 2025-05-07T20:32:36.4109548Z D=7168, 2025-05-07T20:32:36.4109748Z scale_ub=1200.0, 2025-05-07T20:32:36.4109977Z contiguous=True, 2025-05-07T20:32:36.4110198Z compiled=False, 2025-05-07T20:32:36.4110408Z ) 2025-05-07T20:32:36.4110730Z self = 2025-05-07T20:32:36.4111325Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.4111597Z 2025-05-07T20:32:36.4111679Z @given( 2025-05-07T20:32:36.4111913Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4112231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4112539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4112870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4113200Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4113548Z ) 2025-05-07T20:32:36.4113895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4114336Z def test_silu_mul_quant( 2025-05-07T20:32:36.4114581Z self, 2025-05-07T20:32:36.4114775Z T: int, 2025-05-07T20:32:36.4114980Z D: int, 2025-05-07T20:32:36.4115201Z scale_ub: Optional[float], 2025-05-07T20:32:36.4115467Z contiguous: bool, 2025-05-07T20:32:36.4115743Z compiled: bool, 2025-05-07T20:32:36.4115992Z ) -> None: 2025-05-07T20:32:36.4116205Z torch.manual_seed(2025) 2025-05-07T20:32:36.4116464Z 2025-05-07T20:32:36.4116743Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4118820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.4120637Z 2025-05-07T20:32:36.4120757Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.4120977Z 2025-05-07T20:32:36.4121081Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4121492Z self=, 2025-05-07T20:32:36.4121899Z T=1, 2025-05-07T20:32:36.4130649Z D=5120, 2025-05-07T20:32:36.4130887Z scale_ub=1200.0, 2025-05-07T20:32:36.4131125Z contiguous=True, 2025-05-07T20:32:36.4131443Z compiled=False, 2025-05-07T20:32:36.4131661Z ) 2025-05-07T20:32:36.4131980Z self = 2025-05-07T20:32:36.4132479Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.4132750Z 2025-05-07T20:32:36.4132832Z @given( 2025-05-07T20:32:36.4133075Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4133389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4133789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4134124Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4134450Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4134743Z ) 2025-05-07T20:32:36.4135097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4135537Z def test_silu_mul_quant( 2025-05-07T20:32:36.4135785Z self, 2025-05-07T20:32:36.4135983Z T: int, 2025-05-07T20:32:36.4136180Z D: int, 2025-05-07T20:32:36.4136403Z scale_ub: Optional[float], 2025-05-07T20:32:36.4136678Z contiguous: bool, 2025-05-07T20:32:36.4136923Z compiled: bool, 2025-05-07T20:32:36.4137144Z ) -> None: 2025-05-07T20:32:36.4137366Z torch.manual_seed(2025) 2025-05-07T20:32:36.4137617Z 2025-05-07T20:32:36.4137888Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4138241Z 2025-05-07T20:32:36.4138444Z x_sign = torch.sign(x) 2025-05-07T20:32:36.4138740Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.4139114Z x = x_sign * x_clamp 2025-05-07T20:32:36.4139357Z x0 = x[:, :D] 2025-05-07T20:32:36.4139574Z x1 = x[:, D:] 2025-05-07T20:32:36.4139793Z 2025-05-07T20:32:36.4139987Z if contiguous: 2025-05-07T20:32:36.4140214Z x0 = x0.contiguous() 2025-05-07T20:32:36.4140482Z x1 = x1.contiguous() 2025-05-07T20:32:36.4140723Z 2025-05-07T20:32:36.4140923Z if scale_ub is not None: 2025-05-07T20:32:36.4141200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.4141588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.4141900Z ) 2025-05-07T20:32:36.4142099Z else: 2025-05-07T20:32:36.4142315Z scale_ub_tensor = None 2025-05-07T20:32:36.4142566Z 2025-05-07T20:32:36.4142802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.4143124Z op = silu_mul_quant 2025-05-07T20:32:36.4143370Z if compiled: 2025-05-07T20:32:36.4143629Z op = torch.compile(op) 2025-05-07T20:32:36.4143929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4144201Z 2025-05-07T20:32:36.4144403Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.4144565Z 2025-05-07T20:32:36.4144674Z moe/activation_test.py:117: 2025-05-07T20:32:36.4145011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4145345Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.4145630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.4146329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.4147008Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.4147545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.4148221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.4148879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.4149414Z kernel = self.compile( 2025-05-07T20:32:36.4149958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.4150657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.4151057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.4151298Z 2025-05-07T20:32:36.4151504Z self = 2025-05-07T20:32:36.4152575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.4153929Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779ced5580>} 2025-05-07T20:32:36.4155259Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.4156270Z context = 2025-05-07T20:32:36.4156563Z 2025-05-07T20:32:36.4156731Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.4157247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.4157708Z module_map=module_map) 2025-05-07T20:32:36.4158073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.4158428Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.4158693Z E ^ 2025-05-07T20:32:36.4159191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.4159638Z 2025-05-07T20:32:36.4160047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.4976696Z 2025-05-07T20:32:36.4976966Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4977587Z self=, 2025-05-07T20:32:36.4978385Z T=2048, 2025-05-07T20:32:36.4978649Z D=5120, 2025-05-07T20:32:36.4978913Z scale_ub=None, 2025-05-07T20:32:36.4979210Z contiguous=True, 2025-05-07T20:32:36.4979528Z compiled=False, 2025-05-07T20:32:36.4979744Z ) 2025-05-07T20:32:36.4980075Z self = 2025-05-07T20:32:36.4980579Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.4980852Z 2025-05-07T20:32:36.4980940Z @given( 2025-05-07T20:32:36.4981186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4981515Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4981826Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4982259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4982603Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4982896Z ) 2025-05-07T20:32:36.4983281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4983724Z def test_silu_mul_quant( 2025-05-07T20:32:36.4983975Z self, 2025-05-07T20:32:36.4984180Z T: int, 2025-05-07T20:32:36.4984388Z D: int, 2025-05-07T20:32:36.4984607Z scale_ub: Optional[float], 2025-05-07T20:32:36.4984887Z contiguous: bool, 2025-05-07T20:32:36.4985137Z compiled: bool, 2025-05-07T20:32:36.4985365Z ) -> None: 2025-05-07T20:32:36.4985589Z torch.manual_seed(2025) 2025-05-07T20:32:36.4985840Z 2025-05-07T20:32:36.4986113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.4986466Z 2025-05-07T20:32:36.4986667Z > x_sign = torch.sign(x) 2025-05-07T20:32:36.4988662Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.4990491Z 2025-05-07T20:32:36.4990612Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:36.4990837Z 2025-05-07T20:32:36.4990948Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.4991366Z self=, 2025-05-07T20:32:36.4991774Z T=16384, 2025-05-07T20:32:36.4991968Z D=5120, 2025-05-07T20:32:36.4992168Z scale_ub=None, 2025-05-07T20:32:36.4992394Z contiguous=True, 2025-05-07T20:32:36.4992618Z compiled=False, 2025-05-07T20:32:36.4992837Z ) 2025-05-07T20:32:36.4993159Z self = 2025-05-07T20:32:36.4993648Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.4993930Z 2025-05-07T20:32:36.4994012Z @given( 2025-05-07T20:32:36.4994248Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.4994565Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.4994870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.4995218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.4995674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.4995961Z ) 2025-05-07T20:32:36.4996314Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.4996759Z def test_silu_mul_quant( 2025-05-07T20:32:36.4996999Z self, 2025-05-07T20:32:36.4997205Z T: int, 2025-05-07T20:32:36.4997410Z D: int, 2025-05-07T20:32:36.4997632Z scale_ub: Optional[float], 2025-05-07T20:32:36.4997912Z contiguous: bool, 2025-05-07T20:32:36.4998443Z compiled: bool, 2025-05-07T20:32:36.4998748Z ) -> None: 2025-05-07T20:32:36.4998973Z torch.manual_seed(2025) 2025-05-07T20:32:36.4999225Z 2025-05-07T20:32:36.4999495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5001554Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5003379Z 2025-05-07T20:32:36.5003499Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5003717Z 2025-05-07T20:32:36.5003821Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5004244Z self=, 2025-05-07T20:32:36.5004642Z T=4096, 2025-05-07T20:32:36.5004837Z D=5120, 2025-05-07T20:32:36.5005037Z scale_ub=None, 2025-05-07T20:32:36.5005254Z contiguous=True, 2025-05-07T20:32:36.5005485Z compiled=False, 2025-05-07T20:32:36.5005699Z ) 2025-05-07T20:32:36.5006019Z self = 2025-05-07T20:32:36.5006506Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.5006781Z 2025-05-07T20:32:36.5006861Z @given( 2025-05-07T20:32:36.5007097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5007415Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5007815Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5008152Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5008482Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5008769Z ) 2025-05-07T20:32:36.5009121Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5009569Z def test_silu_mul_quant( 2025-05-07T20:32:36.5009808Z self, 2025-05-07T20:32:36.5010007Z T: int, 2025-05-07T20:32:36.5010213Z D: int, 2025-05-07T20:32:36.5010431Z scale_ub: Optional[float], 2025-05-07T20:32:36.5010711Z contiguous: bool, 2025-05-07T20:32:36.5010962Z compiled: bool, 2025-05-07T20:32:36.5011191Z ) -> None: 2025-05-07T20:32:36.5011411Z torch.manual_seed(2025) 2025-05-07T20:32:36.5011657Z 2025-05-07T20:32:36.5011930Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5014034Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5015846Z 2025-05-07T20:32:36.5015965Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5016244Z 2025-05-07T20:32:36.5016357Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5016763Z self=, 2025-05-07T20:32:36.5017169Z T=2048, 2025-05-07T20:32:36.5017365Z D=5120, 2025-05-07T20:32:36.5017571Z scale_ub=None, 2025-05-07T20:32:36.5017787Z contiguous=False, 2025-05-07T20:32:36.5018020Z compiled=False, 2025-05-07T20:32:36.5018235Z ) 2025-05-07T20:32:36.5018555Z self = 2025-05-07T20:32:36.5019093Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.5019363Z 2025-05-07T20:32:36.5019452Z @given( 2025-05-07T20:32:36.5019680Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5019994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5020301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5020630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5020963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5021252Z ) 2025-05-07T20:32:36.5021601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5022083Z def test_silu_mul_quant( 2025-05-07T20:32:36.5022328Z self, 2025-05-07T20:32:36.5022531Z T: int, 2025-05-07T20:32:36.5022727Z D: int, 2025-05-07T20:32:36.5022950Z scale_ub: Optional[float], 2025-05-07T20:32:36.5023233Z contiguous: bool, 2025-05-07T20:32:36.5023472Z compiled: bool, 2025-05-07T20:32:36.5023702Z ) -> None: 2025-05-07T20:32:36.5023920Z torch.manual_seed(2025) 2025-05-07T20:32:36.5024160Z 2025-05-07T20:32:36.5024437Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5026472Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5028283Z 2025-05-07T20:32:36.5028402Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5028619Z 2025-05-07T20:32:36.5028730Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5029133Z self=, 2025-05-07T20:32:36.5029536Z T=4096, 2025-05-07T20:32:36.5029729Z D=7168, 2025-05-07T20:32:36.5029919Z scale_ub=None, 2025-05-07T20:32:36.5030140Z contiguous=True, 2025-05-07T20:32:36.5030367Z compiled=True, 2025-05-07T20:32:36.5030572Z ) 2025-05-07T20:32:36.5030890Z self = 2025-05-07T20:32:36.5031381Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:36.5031652Z 2025-05-07T20:32:36.5031741Z @given( 2025-05-07T20:32:36.5031975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5032301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5032613Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5032941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5033276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5033569Z ) 2025-05-07T20:32:36.5033913Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5034357Z def test_silu_mul_quant( 2025-05-07T20:32:36.5034603Z self, 2025-05-07T20:32:36.5034795Z T: int, 2025-05-07T20:32:36.5034996Z D: int, 2025-05-07T20:32:36.5035272Z scale_ub: Optional[float], 2025-05-07T20:32:36.5035544Z contiguous: bool, 2025-05-07T20:32:36.5035786Z compiled: bool, 2025-05-07T20:32:36.5036014Z ) -> None: 2025-05-07T20:32:36.5036233Z torch.manual_seed(2025) 2025-05-07T20:32:36.5036476Z 2025-05-07T20:32:36.5036756Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5038752Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5040601Z 2025-05-07T20:32:36.5040726Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5040936Z 2025-05-07T20:32:36.5041041Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5041451Z self=, 2025-05-07T20:32:36.5041892Z T=2048, 2025-05-07T20:32:36.5042089Z D=5120, 2025-05-07T20:32:36.5042290Z scale_ub=1200.0, 2025-05-07T20:32:36.5042518Z contiguous=False, 2025-05-07T20:32:36.5042746Z compiled=False, 2025-05-07T20:32:36.5583284Z ) 2025-05-07T20:32:36.5583803Z self = 2025-05-07T20:32:36.5584490Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:36.5584857Z 2025-05-07T20:32:36.5584977Z @given( 2025-05-07T20:32:36.5585286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5585629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5585961Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5586287Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5586616Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5586903Z ) 2025-05-07T20:32:36.5587253Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5587872Z def test_silu_mul_quant( 2025-05-07T20:32:36.5588120Z self, 2025-05-07T20:32:36.5588314Z T: int, 2025-05-07T20:32:36.5588516Z D: int, 2025-05-07T20:32:36.5588742Z scale_ub: Optional[float], 2025-05-07T20:32:36.5589011Z contiguous: bool, 2025-05-07T20:32:36.5589258Z compiled: bool, 2025-05-07T20:32:36.5589491Z ) -> None: 2025-05-07T20:32:36.5589713Z torch.manual_seed(2025) 2025-05-07T20:32:36.5589955Z 2025-05-07T20:32:36.5590230Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5592231Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5594044Z 2025-05-07T20:32:36.5594170Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5594381Z 2025-05-07T20:32:36.5594485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5594895Z self=, 2025-05-07T20:32:36.5595315Z T=4096, 2025-05-07T20:32:36.5595542Z D=7168, 2025-05-07T20:32:36.5595744Z scale_ub=1200.0, 2025-05-07T20:32:36.5595973Z contiguous=True, 2025-05-07T20:32:36.5596306Z compiled=False, 2025-05-07T20:32:36.5596517Z ) 2025-05-07T20:32:36.5596832Z self = 2025-05-07T20:32:36.5597323Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.5597610Z 2025-05-07T20:32:36.5597689Z @given( 2025-05-07T20:32:36.5597928Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5598538Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5598926Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5599258Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5599582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5599872Z ) 2025-05-07T20:32:36.5600221Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5600658Z def test_silu_mul_quant( 2025-05-07T20:32:36.5600903Z self, 2025-05-07T20:32:36.5601110Z T: int, 2025-05-07T20:32:36.5601306Z D: int, 2025-05-07T20:32:36.5601528Z scale_ub: Optional[float], 2025-05-07T20:32:36.5601803Z contiguous: bool, 2025-05-07T20:32:36.5602046Z compiled: bool, 2025-05-07T20:32:36.5602269Z ) -> None: 2025-05-07T20:32:36.5602564Z torch.manual_seed(2025) 2025-05-07T20:32:36.5602812Z 2025-05-07T20:32:36.5603080Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5605075Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5606888Z 2025-05-07T20:32:36.5607005Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5607218Z 2025-05-07T20:32:36.5607327Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5607741Z self=, 2025-05-07T20:32:36.5608197Z T=16384, 2025-05-07T20:32:36.5608398Z D=7168, 2025-05-07T20:32:36.5608597Z scale_ub=None, 2025-05-07T20:32:36.5608813Z contiguous=False, 2025-05-07T20:32:36.5609041Z compiled=True, 2025-05-07T20:32:36.5609247Z ) 2025-05-07T20:32:36.5609559Z self = 2025-05-07T20:32:36.5610054Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:36.5610328Z 2025-05-07T20:32:36.5610417Z @given( 2025-05-07T20:32:36.5610645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5610965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5611278Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5611616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5611944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5612240Z ) 2025-05-07T20:32:36.5612596Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5613035Z def test_silu_mul_quant( 2025-05-07T20:32:36.5613284Z self, 2025-05-07T20:32:36.5613488Z T: int, 2025-05-07T20:32:36.5613766Z D: int, 2025-05-07T20:32:36.5613993Z scale_ub: Optional[float], 2025-05-07T20:32:36.5614270Z contiguous: bool, 2025-05-07T20:32:36.5614509Z compiled: bool, 2025-05-07T20:32:36.5614736Z ) -> None: 2025-05-07T20:32:36.5614955Z torch.manual_seed(2025) 2025-05-07T20:32:36.5615197Z 2025-05-07T20:32:36.5615473Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5617545Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5619391Z 2025-05-07T20:32:36.5619512Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5619726Z 2025-05-07T20:32:36.5619843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5620249Z self=, 2025-05-07T20:32:36.5620654Z T=4096, 2025-05-07T20:32:36.5620848Z D=7168, 2025-05-07T20:32:36.5621042Z scale_ub=None, 2025-05-07T20:32:36.5621260Z contiguous=True, 2025-05-07T20:32:36.5621489Z compiled=False, 2025-05-07T20:32:36.5621693Z ) 2025-05-07T20:32:36.5622019Z self = 2025-05-07T20:32:36.5622580Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.5622854Z 2025-05-07T20:32:36.5622942Z @given( 2025-05-07T20:32:36.5623170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5623495Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5623808Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5624137Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5624471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5624764Z ) 2025-05-07T20:32:36.5625112Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5625563Z def test_silu_mul_quant( 2025-05-07T20:32:36.5625815Z self, 2025-05-07T20:32:36.5626013Z T: int, 2025-05-07T20:32:36.5626220Z D: int, 2025-05-07T20:32:36.5626446Z scale_ub: Optional[float], 2025-05-07T20:32:36.5626725Z contiguous: bool, 2025-05-07T20:32:36.5626971Z compiled: bool, 2025-05-07T20:32:36.5627219Z ) -> None: 2025-05-07T20:32:36.5627490Z torch.manual_seed(2025) 2025-05-07T20:32:36.5627736Z 2025-05-07T20:32:36.5628018Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5630016Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5631828Z 2025-05-07T20:32:36.5631957Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5632169Z 2025-05-07T20:32:36.5632285Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5632696Z self=, 2025-05-07T20:32:36.5633103Z T=16384, 2025-05-07T20:32:36.5633303Z D=7168, 2025-05-07T20:32:36.5633498Z scale_ub=None, 2025-05-07T20:32:36.5641906Z contiguous=True, 2025-05-07T20:32:36.5642153Z compiled=False, 2025-05-07T20:32:36.5642370Z ) 2025-05-07T20:32:36.5642698Z self = 2025-05-07T20:32:36.5643194Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:36.5643481Z 2025-05-07T20:32:36.5643562Z @given( 2025-05-07T20:32:36.5643885Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5644204Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5644510Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5644844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5645182Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5645470Z ) 2025-05-07T20:32:36.5645825Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5646270Z def test_silu_mul_quant( 2025-05-07T20:32:36.5646561Z self, 2025-05-07T20:32:36.5646766Z T: int, 2025-05-07T20:32:36.5646969Z D: int, 2025-05-07T20:32:36.5647186Z scale_ub: Optional[float], 2025-05-07T20:32:36.5647471Z contiguous: bool, 2025-05-07T20:32:36.5647721Z compiled: bool, 2025-05-07T20:32:36.5647952Z ) -> None: 2025-05-07T20:32:36.5648166Z torch.manual_seed(2025) 2025-05-07T20:32:36.5648410Z 2025-05-07T20:32:36.5648688Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5650743Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5652570Z 2025-05-07T20:32:36.5652690Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.5652908Z 2025-05-07T20:32:36.5653014Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.5653426Z self=, 2025-05-07T20:32:36.5653963Z T=16384, 2025-05-07T20:32:36.5654157Z D=7168, 2025-05-07T20:32:36.5654356Z scale_ub=1200.0, 2025-05-07T20:32:36.5654588Z contiguous=True, 2025-05-07T20:32:36.5654809Z compiled=False, 2025-05-07T20:32:36.5655020Z ) 2025-05-07T20:32:36.5655344Z self = 2025-05-07T20:32:36.5655879Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:36.5656153Z 2025-05-07T20:32:36.5656234Z @given( 2025-05-07T20:32:36.5656474Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.5656788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.5657095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.5657421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.5657755Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.5658043Z ) 2025-05-07T20:32:36.5658389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.5658836Z def test_silu_mul_quant( 2025-05-07T20:32:36.5659083Z self, 2025-05-07T20:32:36.5659279Z T: int, 2025-05-07T20:32:36.5659479Z D: int, 2025-05-07T20:32:36.5659700Z scale_ub: Optional[float], 2025-05-07T20:32:36.5659974Z contiguous: bool, 2025-05-07T20:32:36.5660221Z compiled: bool, 2025-05-07T20:32:36.5660447Z ) -> None: 2025-05-07T20:32:36.5660662Z torch.manual_seed(2025) 2025-05-07T20:32:36.5660905Z 2025-05-07T20:32:36.5661181Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.5663174Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.5665027Z 2025-05-07T20:32:36.5665155Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.7464000Z 2025-05-07T20:32:36.7464449Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7464899Z self=, 2025-05-07T20:32:36.7465904Z T=128, 2025-05-07T20:32:36.7466279Z D=5120, 2025-05-07T20:32:36.7466663Z scale_ub=1200.0, 2025-05-07T20:32:36.7467094Z contiguous=False, 2025-05-07T20:32:36.7467536Z compiled=False, 2025-05-07T20:32:36.7467940Z ) 2025-05-07T20:32:36.7468555Z self = 2025-05-07T20:32:36.7469531Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:36.7470089Z 2025-05-07T20:32:36.7470247Z @given( 2025-05-07T20:32:36.7470698Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7471300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7471899Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7472676Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7473312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7473873Z ) 2025-05-07T20:32:36.7474560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7475279Z def test_silu_mul_quant( 2025-05-07T20:32:36.7475559Z self, 2025-05-07T20:32:36.7475759Z T: int, 2025-05-07T20:32:36.7475956Z D: int, 2025-05-07T20:32:36.7476173Z scale_ub: Optional[float], 2025-05-07T20:32:36.7476447Z contiguous: bool, 2025-05-07T20:32:36.7476686Z compiled: bool, 2025-05-07T20:32:36.7476907Z ) -> None: 2025-05-07T20:32:36.7477132Z torch.manual_seed(2025) 2025-05-07T20:32:36.7477376Z 2025-05-07T20:32:36.7477641Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7477983Z 2025-05-07T20:32:36.7478185Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7478478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7478865Z x = x_sign * x_clamp 2025-05-07T20:32:36.7479110Z x0 = x[:, :D] 2025-05-07T20:32:36.7479318Z x1 = x[:, D:] 2025-05-07T20:32:36.7479530Z 2025-05-07T20:32:36.7479718Z if contiguous: 2025-05-07T20:32:36.7479943Z x0 = x0.contiguous() 2025-05-07T20:32:36.7480200Z x1 = x1.contiguous() 2025-05-07T20:32:36.7480440Z 2025-05-07T20:32:36.7480626Z if scale_ub is not None: 2025-05-07T20:32:36.7480899Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7481231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7481538Z ) 2025-05-07T20:32:36.7481732Z else: 2025-05-07T20:32:36.7481945Z scale_ub_tensor = None 2025-05-07T20:32:36.7482193Z 2025-05-07T20:32:36.7482417Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7482731Z op = silu_mul_quant 2025-05-07T20:32:36.7482982Z if compiled: 2025-05-07T20:32:36.7483229Z op = torch.compile(op) 2025-05-07T20:32:36.7483526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7483831Z 2025-05-07T20:32:36.7484028Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.7484189Z 2025-05-07T20:32:36.7484297Z moe/activation_test.py:117: 2025-05-07T20:32:36.7484584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7484919Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.7485202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7485886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.7486642Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.7487175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7487850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7488506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7489038Z kernel = self.compile( 2025-05-07T20:32:36.7489631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7490280Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7490667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7490898Z 2025-05-07T20:32:36.7491102Z self = 2025-05-07T20:32:36.7492168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7493561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779cfc11c0>} 2025-05-07T20:32:36.7494966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7495976Z context = 2025-05-07T20:32:36.7496264Z 2025-05-07T20:32:36.7496427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7496939Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7497402Z module_map=module_map) 2025-05-07T20:32:36.7497772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7498124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7498543Z E ^ 2025-05-07T20:32:36.7499079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7499525Z 2025-05-07T20:32:36.7499939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:36.7500442Z 2025-05-07T20:32:36.7500549Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7500954Z self=, 2025-05-07T20:32:36.7501353Z T=2048, 2025-05-07T20:32:36.7501541Z D=7168, 2025-05-07T20:32:36.7501725Z scale_ub=None, 2025-05-07T20:32:36.7501944Z contiguous=False, 2025-05-07T20:32:36.7502167Z compiled=False, 2025-05-07T20:32:36.7502362Z ) 2025-05-07T20:32:36.7502673Z self = 2025-05-07T20:32:36.7503161Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:36.7503428Z 2025-05-07T20:32:36.7503517Z @given( 2025-05-07T20:32:36.7503737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7504046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7504352Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7504672Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7504999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7505291Z ) 2025-05-07T20:32:36.7505629Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7506069Z def test_silu_mul_quant( 2025-05-07T20:32:36.7506379Z self, 2025-05-07T20:32:36.7506574Z T: int, 2025-05-07T20:32:36.7506765Z D: int, 2025-05-07T20:32:36.7506983Z scale_ub: Optional[float], 2025-05-07T20:32:36.7507252Z contiguous: bool, 2025-05-07T20:32:36.7507480Z compiled: bool, 2025-05-07T20:32:36.7507702Z ) -> None: 2025-05-07T20:32:36.7507914Z torch.manual_seed(2025) 2025-05-07T20:32:36.7508153Z 2025-05-07T20:32:36.7508421Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7510520Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:36.7512327Z 2025-05-07T20:32:36.7512449Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:36.7512657Z 2025-05-07T20:32:36.7512765Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:36.7513221Z self=, 2025-05-07T20:32:36.7513615Z T=128, 2025-05-07T20:32:36.7513803Z D=7168, 2025-05-07T20:32:36.7513991Z scale_ub=1200.0, 2025-05-07T20:32:36.7514215Z contiguous=True, 2025-05-07T20:32:36.7514433Z compiled=True, 2025-05-07T20:32:36.7514630Z ) 2025-05-07T20:32:36.7514952Z self = 2025-05-07T20:32:36.7515432Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:36.7515701Z 2025-05-07T20:32:36.7515774Z @given( 2025-05-07T20:32:36.7516004Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:36.7516313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:36.7516615Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:36.7516945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:36.7517266Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:36.7517553Z ) 2025-05-07T20:32:36.7517939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:36.7518380Z def test_silu_mul_quant( 2025-05-07T20:32:36.7518619Z self, 2025-05-07T20:32:36.7518816Z T: int, 2025-05-07T20:32:36.7519013Z D: int, 2025-05-07T20:32:36.7519225Z scale_ub: Optional[float], 2025-05-07T20:32:36.7519495Z contiguous: bool, 2025-05-07T20:32:36.7519739Z compiled: bool, 2025-05-07T20:32:36.7519954Z ) -> None: 2025-05-07T20:32:36.7520166Z torch.manual_seed(2025) 2025-05-07T20:32:36.7520405Z 2025-05-07T20:32:36.7520660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:36.7521000Z 2025-05-07T20:32:36.7521190Z x_sign = torch.sign(x) 2025-05-07T20:32:36.7521470Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:36.7521777Z x = x_sign * x_clamp 2025-05-07T20:32:36.7522015Z x0 = x[:, :D] 2025-05-07T20:32:36.7522224Z x1 = x[:, D:] 2025-05-07T20:32:36.7522435Z 2025-05-07T20:32:36.7522618Z if contiguous: 2025-05-07T20:32:36.7522844Z x0 = x0.contiguous() 2025-05-07T20:32:36.7523099Z x1 = x1.contiguous() 2025-05-07T20:32:36.7523339Z 2025-05-07T20:32:36.7523528Z if scale_ub is not None: 2025-05-07T20:32:36.7523794Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:36.7524124Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:36.7524432Z ) 2025-05-07T20:32:36.7524622Z else: 2025-05-07T20:32:36.7524828Z scale_ub_tensor = None 2025-05-07T20:32:36.7525126Z 2025-05-07T20:32:36.7525375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:36.7525712Z op = silu_mul_quant 2025-05-07T20:32:36.7525958Z if compiled: 2025-05-07T20:32:36.7526198Z op = torch.compile(op) 2025-05-07T20:32:36.7526498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7526773Z 2025-05-07T20:32:36.7526961Z > y_fp8, y_scale = fn() 2025-05-07T20:32:36.7527128Z 2025-05-07T20:32:36.7527225Z moe/activation_test.py:117: 2025-05-07T20:32:36.7527561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7527890Z moe/activation_test.py:115: in fn 2025-05-07T20:32:36.7528165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:36.7528717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:36.7529274Z return fn(*args, **kwargs) 2025-05-07T20:32:36.7529917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:36.7530595Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:36.7531169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:36.7531842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:36.7532496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:36.7533024Z kernel = self.compile( 2025-05-07T20:32:36.7533556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:36.7534283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:36.7534670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:36.7534902Z 2025-05-07T20:32:36.7535106Z self = 2025-05-07T20:32:36.7536219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:36.7537604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d0abb00>} 2025-05-07T20:32:36.7538916Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:36.7539921Z context = 2025-05-07T20:32:36.7540206Z 2025-05-07T20:32:36.7540367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:36.7540881Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:36.7541336Z module_map=module_map) 2025-05-07T20:32:36.7541699Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:36.7542044Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:36.7542299Z E ^ 2025-05-07T20:32:36.7542750Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:36.7543192Z 2025-05-07T20:32:36.7543599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0313514Z 2025-05-07T20:32:37.0313803Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0314251Z self=, 2025-05-07T20:32:37.0314725Z T=128, 2025-05-07T20:32:37.0315053Z D=7168, 2025-05-07T20:32:37.0315254Z scale_ub=1200.0, 2025-05-07T20:32:37.0315498Z contiguous=True, 2025-05-07T20:32:37.0315735Z compiled=False, 2025-05-07T20:32:37.0315947Z ) 2025-05-07T20:32:37.0316280Z self = 2025-05-07T20:32:37.0316793Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0317065Z 2025-05-07T20:32:37.0317158Z @given( 2025-05-07T20:32:37.0317395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0317788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0318107Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0318443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0318782Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0319085Z ) 2025-05-07T20:32:37.0319436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0319886Z def test_silu_mul_quant( 2025-05-07T20:32:37.0320159Z self, 2025-05-07T20:32:37.0320367Z T: int, 2025-05-07T20:32:37.0320570Z D: int, 2025-05-07T20:32:37.0320795Z scale_ub: Optional[float], 2025-05-07T20:32:37.0321074Z contiguous: bool, 2025-05-07T20:32:37.0321437Z compiled: bool, 2025-05-07T20:32:37.0321678Z ) -> None: 2025-05-07T20:32:37.0321905Z torch.manual_seed(2025) 2025-05-07T20:32:37.0322150Z 2025-05-07T20:32:37.0322430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0322787Z 2025-05-07T20:32:37.0322992Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0323286Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0325268Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0327091Z 2025-05-07T20:32:37.0327276Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.0327492Z 2025-05-07T20:32:37.0327604Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0328018Z self=, 2025-05-07T20:32:37.0328427Z T=128, 2025-05-07T20:32:37.0328623Z D=5120, 2025-05-07T20:32:37.0328826Z scale_ub=1200.0, 2025-05-07T20:32:37.0329052Z contiguous=True, 2025-05-07T20:32:37.0329288Z compiled=True, 2025-05-07T20:32:37.0329500Z ) 2025-05-07T20:32:37.0329820Z self = 2025-05-07T20:32:37.0330321Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.0330590Z 2025-05-07T20:32:37.0330678Z @given( 2025-05-07T20:32:37.0330914Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0331241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0331559Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0331892Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0332228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0332525Z ) 2025-05-07T20:32:37.0332881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0333323Z def test_silu_mul_quant( 2025-05-07T20:32:37.0333576Z self, 2025-05-07T20:32:37.0333860Z T: int, 2025-05-07T20:32:37.0334059Z D: int, 2025-05-07T20:32:37.0334283Z scale_ub: Optional[float], 2025-05-07T20:32:37.0334565Z contiguous: bool, 2025-05-07T20:32:37.0334855Z compiled: bool, 2025-05-07T20:32:37.0335086Z ) -> None: 2025-05-07T20:32:37.0335312Z torch.manual_seed(2025) 2025-05-07T20:32:37.0335557Z 2025-05-07T20:32:37.0335835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0336183Z 2025-05-07T20:32:37.0336382Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0336680Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0338630Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0340646Z 2025-05-07T20:32:37.0340778Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.0340994Z 2025-05-07T20:32:37.0341110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0341576Z self=, 2025-05-07T20:32:37.0341996Z T=128, 2025-05-07T20:32:37.0342198Z D=7168, 2025-05-07T20:32:37.0342396Z scale_ub=None, 2025-05-07T20:32:37.0342615Z contiguous=True, 2025-05-07T20:32:37.0342852Z compiled=True, 2025-05-07T20:32:37.0343061Z ) 2025-05-07T20:32:37.0343493Z self = 2025-05-07T20:32:37.0344016Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0344401Z 2025-05-07T20:32:37.0344547Z @given( 2025-05-07T20:32:37.0344866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0345306Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0345741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0346189Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0346644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0347038Z ) 2025-05-07T20:32:37.0347582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0348217Z def test_silu_mul_quant( 2025-05-07T20:32:37.0348559Z self, 2025-05-07T20:32:37.0348835Z T: int, 2025-05-07T20:32:37.0349115Z D: int, 2025-05-07T20:32:37.0349418Z scale_ub: Optional[float], 2025-05-07T20:32:37.0349819Z contiguous: bool, 2025-05-07T20:32:37.0350141Z compiled: bool, 2025-05-07T20:32:37.0350462Z ) -> None: 2025-05-07T20:32:37.0350758Z torch.manual_seed(2025) 2025-05-07T20:32:37.0351087Z 2025-05-07T20:32:37.0351461Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0366960Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0369558Z 2025-05-07T20:32:37.0369730Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.0370026Z 2025-05-07T20:32:37.0370544Z FAILED 2025-05-07T20:32:37.0370693Z 2025-05-07T20:32:37.0370882Z =================================== FAILURES =================================== 2025-05-07T20:32:37.0371467Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:37.0372191Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:37.0373026Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:37.0373884Z | yield 2025-05-07T20:32:37.0374483Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:37.0375200Z | self._callTestMethod(testMethod) 2025-05-07T20:32:37.0375648Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:37.0376433Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:37.0377194Z | if method() is not None: 2025-05-07T20:32:37.0377541Z | ~~~~~~^^ 2025-05-07T20:32:37.0378393Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:37.0379376Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0379788Z | ^^^^^^^ 2025-05-07T20:32:37.0380536Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:37.0381428Z | raise the_error_hypothesis_found 2025-05-07T20:32:37.0382053Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:37.0382622Z +-+---------------- 1 ---------------- 2025-05-07T20:32:37.0383013Z | Traceback (most recent call last): 2025-05-07T20:32:37.0383975Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.0385049Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0389271Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0391307Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.0391840Z | self=, 2025-05-07T20:32:37.0392270Z | T=2048, 2025-05-07T20:32:37.0392506Z | D=5120, # or any other generated value 2025-05-07T20:32:37.0392845Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:37.0393204Z | contiguous=True, # or any other generated value 2025-05-07T20:32:37.0393571Z | compiled=False, # or any other generated value 2025-05-07T20:32:37.0393894Z | ) 2025-05-07T20:32:37.0394067Z | 2025-05-07T20:32:37.0394679Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:37.0395290Z +---------------- 2 ---------------- 2025-05-07T20:32:37.0395584Z | Traceback (most recent call last): 2025-05-07T20:32:37.0396274Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.0397038Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0399221Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0401241Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.0401674Z | self=, 2025-05-07T20:32:37.0402072Z | T=128, 2025-05-07T20:32:37.0402275Z | D=7168, 2025-05-07T20:32:37.0402548Z | scale_ub=None, 2025-05-07T20:32:37.0402779Z | contiguous=True, 2025-05-07T20:32:37.0403020Z | compiled=True, 2025-05-07T20:32:37.0403245Z | ) 2025-05-07T20:32:37.0403421Z | 2025-05-07T20:32:37.0403933Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.0404521Z +---------------- 3 ---------------- 2025-05-07T20:32:37.0404806Z | Traceback (most recent call last): 2025-05-07T20:32:37.0405550Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:37.0406380Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0408358Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.0410264Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.0410697Z | self=, 2025-05-07T20:32:37.0411189Z | T=128, 2025-05-07T20:32:37.0411463Z | D=5120, 2025-05-07T20:32:37.0411741Z | scale_ub=1200.0, 2025-05-07T20:32:37.0412076Z | contiguous=True, 2025-05-07T20:32:37.0412407Z | compiled=True, 2025-05-07T20:32:37.0412784Z | ) 2025-05-07T20:32:37.0413038Z | 2025-05-07T20:32:37.0413859Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.0414686Z +---------------- 4 ---------------- 2025-05-07T20:32:37.0415080Z | Traceback (most recent call last): 2025-05-07T20:32:37.0416042Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:37.0417000Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0417397Z | ~~~~~~^^ 2025-05-07T20:32:37.0418268Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:37.0419215Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0420347Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:37.0421424Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0421818Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:37.0422168Z | a, 2025-05-07T20:32:37.0422443Z | ^^ 2025-05-07T20:32:37.0422726Z | ...<23 lines>... 2025-05-07T20:32:37.0423058Z | USE_INT64=use_int64, 2025-05-07T20:32:37.0423415Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.0423738Z | ) 2025-05-07T20:32:37.0424078Z | ^ 2025-05-07T20:32:37.0424783Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:37.0425773Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0426397Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.0427273Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:37.0428381Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0429014Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.0429891Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:37.0430835Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0431354Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:37.0432173Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:37.0432999Z | fn() 2025-05-07T20:32:37.0433280Z | ~~^^ 2025-05-07T20:32:37.0434046Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:37.0434923Z | self.fn.run( 2025-05-07T20:32:37.0435231Z | ~~~~~~~~~~~^ 2025-05-07T20:32:37.0435529Z | *args, 2025-05-07T20:32:37.0435828Z | ^^^^^^ 2025-05-07T20:32:37.0436126Z | **current, 2025-05-07T20:32:37.0436435Z | ^^^^^^^^^^ 2025-05-07T20:32:37.0436738Z | ) 2025-05-07T20:32:37.0436999Z | ^ 2025-05-07T20:32:37.0437669Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:37.0438469Z | kernel = self.compile( 2025-05-07T20:32:37.0438821Z | src, 2025-05-07T20:32:37.0439120Z | target=target, 2025-05-07T20:32:37.0439474Z | options=options.__dict__, 2025-05-07T20:32:37.0439856Z | ) 2025-05-07T20:32:37.0440652Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:37.0441616Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0442577Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:37.0443663Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0444303Z | module_map=module_map) 2025-05-07T20:32:37.0444800Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0445287Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0445637Z | ^ 2025-05-07T20:32:37.0446264Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0447039Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:37.0447592Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:37.0448308Z | self=, 2025-05-07T20:32:37.0448906Z | T=1, # or any other generated value 2025-05-07T20:32:37.0449337Z | D=5120, # or any other generated value 2025-05-07T20:32:37.0449794Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:37.0450274Z | contiguous=True, # or any other generated value 2025-05-07T20:32:37.0450780Z | compiled=True, # or any other generated value 2025-05-07T20:32:37.0451259Z | ) 2025-05-07T20:32:37.0451506Z | 2025-05-07T20:32:37.0452225Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:37.0453065Z +------------------------------------ 2025-05-07T20:32:37.0453565Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:37.0454188Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0454755Z self=, 2025-05-07T20:32:37.0455360Z T=1, 2025-05-07T20:32:37.0455620Z D=5120, 2025-05-07T20:32:37.0455893Z scale_ub=None, 2025-05-07T20:32:37.0456172Z contiguous=True, 2025-05-07T20:32:37.0456463Z compiled=True, 2025-05-07T20:32:37.0456730Z ) 2025-05-07T20:32:37.0457140Z self = 2025-05-07T20:32:37.0457768Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0458103Z 2025-05-07T20:32:37.0458217Z @given( 2025-05-07T20:32:37.0458515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0458941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0459409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0459855Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0460290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0460690Z ) 2025-05-07T20:32:37.0461175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0461751Z def test_silu_mul_quant( 2025-05-07T20:32:37.0462071Z self, 2025-05-07T20:32:37.0462342Z T: int, 2025-05-07T20:32:37.0462607Z D: int, 2025-05-07T20:32:37.0462908Z scale_ub: Optional[float], 2025-05-07T20:32:37.0463290Z contiguous: bool, 2025-05-07T20:32:37.0463616Z compiled: bool, 2025-05-07T20:32:37.0463914Z ) -> None: 2025-05-07T20:32:37.0464211Z torch.manual_seed(2025) 2025-05-07T20:32:37.0464529Z 2025-05-07T20:32:37.0464908Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0465380Z 2025-05-07T20:32:37.0465650Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0466141Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0466572Z x = x_sign * x_clamp 2025-05-07T20:32:37.0466896Z x0 = x[:, :D] 2025-05-07T20:32:37.0467183Z x1 = x[:, D:] 2025-05-07T20:32:37.0467472Z 2025-05-07T20:32:37.0467733Z if contiguous: 2025-05-07T20:32:37.0468047Z x0 = x0.contiguous() 2025-05-07T20:32:37.0468409Z x1 = x1.contiguous() 2025-05-07T20:32:37.0468724Z 2025-05-07T20:32:37.0468972Z if scale_ub is not None: 2025-05-07T20:32:37.0469338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0469776Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0470184Z ) 2025-05-07T20:32:37.0470447Z else: 2025-05-07T20:32:37.0470741Z scale_ub_tensor = None 2025-05-07T20:32:37.0471079Z 2025-05-07T20:32:37.0471402Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0471844Z op = silu_mul_quant 2025-05-07T20:32:37.0472189Z if compiled: 2025-05-07T20:32:37.0472541Z op = torch.compile(op) 2025-05-07T20:32:37.0472952Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0473337Z 2025-05-07T20:32:37.0473599Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0473994Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0474412Z 2025-05-07T20:32:37.0474732Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0475197Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0475604Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0476076Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0476558Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0476967Z 2025-05-07T20:32:37.0477226Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0477495Z 2025-05-07T20:32:37.0477632Z moe/activation_test.py:126: 2025-05-07T20:32:37.0478040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0478488Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0478971Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0480041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0481081Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0481826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0482781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0483706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0484772Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0485737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0486578Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0487376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0488055Z fn() 2025-05-07T20:32:37.0488722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0489488Z self.fn.run( 2025-05-07T20:32:37.0490111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0490808Z kernel = self.compile( 2025-05-07T20:32:37.0491513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0492384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0492976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0493287Z 2025-05-07T20:32:37.0493555Z self = 2025-05-07T20:32:37.0495124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0497006Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c52aa700>} 2025-05-07T20:32:37.0499057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0500433Z context = 2025-05-07T20:32:37.0500826Z 2025-05-07T20:32:37.0501041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0501770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0502396Z module_map=module_map) 2025-05-07T20:32:37.0502889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0503376Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0503753Z E ^ 2025-05-07T20:32:37.0504373Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0505133Z 2025-05-07T20:32:37.0505748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0506450Z 2025-05-07T20:32:37.0506598Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0507160Z self=, 2025-05-07T20:32:37.0507694Z T=2048, 2025-05-07T20:32:37.0507955Z D=5120, 2025-05-07T20:32:37.0508311Z scale_ub=1200.0, 2025-05-07T20:32:37.0508603Z contiguous=True, 2025-05-07T20:32:37.0508916Z compiled=False, 2025-05-07T20:32:37.0509202Z ) 2025-05-07T20:32:37.0509623Z self = 2025-05-07T20:32:37.0510287Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0510660Z 2025-05-07T20:32:37.0510766Z @given( 2025-05-07T20:32:37.0511086Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0511498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0511919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0512375Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0512902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0513299Z ) 2025-05-07T20:32:37.0513768Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0514363Z def test_silu_mul_quant( 2025-05-07T20:32:37.0514684Z self, 2025-05-07T20:32:37.0514950Z T: int, 2025-05-07T20:32:37.0515214Z D: int, 2025-05-07T20:32:37.0515492Z scale_ub: Optional[float], 2025-05-07T20:32:37.0515854Z contiguous: bool, 2025-05-07T20:32:37.0516176Z compiled: bool, 2025-05-07T20:32:37.0516467Z ) -> None: 2025-05-07T20:32:37.0516748Z torch.manual_seed(2025) 2025-05-07T20:32:37.0517077Z 2025-05-07T20:32:37.0517430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0517901Z 2025-05-07T20:32:37.0518172Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0518562Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0518996Z x = x_sign * x_clamp 2025-05-07T20:32:37.0519327Z x0 = x[:, :D] 2025-05-07T20:32:37.0519707Z x1 = x[:, D:] 2025-05-07T20:32:37.0520002Z 2025-05-07T20:32:37.0520264Z if contiguous: 2025-05-07T20:32:37.0520582Z x0 = x0.contiguous() 2025-05-07T20:32:37.0520940Z x1 = x1.contiguous() 2025-05-07T20:32:37.0521272Z 2025-05-07T20:32:37.0521535Z if scale_ub is not None: 2025-05-07T20:32:37.0521921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0522384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0522813Z ) 2025-05-07T20:32:37.0523074Z else: 2025-05-07T20:32:37.0523363Z scale_ub_tensor = None 2025-05-07T20:32:37.0523698Z 2025-05-07T20:32:37.0524006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0524421Z op = silu_mul_quant 2025-05-07T20:32:37.0524769Z if compiled: 2025-05-07T20:32:37.0525111Z op = torch.compile(op) 2025-05-07T20:32:37.0525554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0525943Z 2025-05-07T20:32:37.0526209Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0526430Z 2025-05-07T20:32:37.0526565Z moe/activation_test.py:117: 2025-05-07T20:32:37.0526972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0527431Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0527816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0528757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0529750Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0530455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0531363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0532283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0533027Z kernel = self.compile( 2025-05-07T20:32:37.0533915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0534950Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0535524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0535847Z 2025-05-07T20:32:37.0536132Z self = 2025-05-07T20:32:37.0537607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0539578Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c5162020>} 2025-05-07T20:32:37.0541334Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0542689Z context = 2025-05-07T20:32:37.0543067Z 2025-05-07T20:32:37.0543275Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0543935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0544574Z module_map=module_map) 2025-05-07T20:32:37.0545060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0545565Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0545915Z E ^ 2025-05-07T20:32:37.0546588Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0547178Z 2025-05-07T20:32:37.0547722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0548411Z 2025-05-07T20:32:37.0548553Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0549102Z self=, 2025-05-07T20:32:37.0549633Z T=2048, 2025-05-07T20:32:37.0549880Z D=5120, 2025-05-07T20:32:37.0550151Z scale_ub=1200.0, 2025-05-07T20:32:37.0550450Z contiguous=True, 2025-05-07T20:32:37.0550742Z compiled=True, 2025-05-07T20:32:37.0551019Z ) 2025-05-07T20:32:37.0551461Z self = 2025-05-07T20:32:37.0552113Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.0552504Z 2025-05-07T20:32:37.0552619Z @given( 2025-05-07T20:32:37.0552941Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0553377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0553797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0554263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0554684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0554967Z ) 2025-05-07T20:32:37.0555319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0555763Z def test_silu_mul_quant( 2025-05-07T20:32:37.0556001Z self, 2025-05-07T20:32:37.0556203Z T: int, 2025-05-07T20:32:37.0556558Z D: int, 2025-05-07T20:32:37.0556858Z scale_ub: Optional[float], 2025-05-07T20:32:37.0557220Z contiguous: bool, 2025-05-07T20:32:37.0557511Z compiled: bool, 2025-05-07T20:32:37.0558060Z ) -> None: 2025-05-07T20:32:37.0558362Z torch.manual_seed(2025) 2025-05-07T20:32:37.0558687Z 2025-05-07T20:32:37.0574839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0575183Z 2025-05-07T20:32:37.0575371Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0575753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0576056Z x = x_sign * x_clamp 2025-05-07T20:32:37.0576286Z x0 = x[:, :D] 2025-05-07T20:32:37.0576489Z x1 = x[:, D:] 2025-05-07T20:32:37.0576686Z 2025-05-07T20:32:37.0576861Z if contiguous: 2025-05-07T20:32:37.0577078Z x0 = x0.contiguous() 2025-05-07T20:32:37.0577323Z x1 = x1.contiguous() 2025-05-07T20:32:37.0577555Z 2025-05-07T20:32:37.0577733Z if scale_ub is not None: 2025-05-07T20:32:37.0577994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0578319Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0578617Z ) 2025-05-07T20:32:37.0578854Z else: 2025-05-07T20:32:37.0579057Z scale_ub_tensor = None 2025-05-07T20:32:37.0579302Z 2025-05-07T20:32:37.0579523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0579828Z op = silu_mul_quant 2025-05-07T20:32:37.0580073Z if compiled: 2025-05-07T20:32:37.0580309Z op = torch.compile(op) 2025-05-07T20:32:37.0580603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0580879Z 2025-05-07T20:32:37.0581064Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0581346Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0581626Z 2025-05-07T20:32:37.0581851Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0582192Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0582479Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0582788Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0583137Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0583447Z 2025-05-07T20:32:37.0583694Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0583887Z 2025-05-07T20:32:37.0583987Z moe/activation_test.py:126: 2025-05-07T20:32:37.0584280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0584621Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0584936Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0585711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0586451Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0586990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0587667Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0588350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0589059Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0589777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0590399Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0590986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0591501Z fn() 2025-05-07T20:32:37.0592000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0592619Z self.fn.run( 2025-05-07T20:32:37.0593077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0593596Z kernel = self.compile( 2025-05-07T20:32:37.0594130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0594769Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0595207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0595431Z 2025-05-07T20:32:37.0595641Z self = 2025-05-07T20:32:37.0596694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0598046Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c402ede0>} 2025-05-07T20:32:37.0599899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0600912Z context = 2025-05-07T20:32:37.0601193Z 2025-05-07T20:32:37.0601360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0601865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0602323Z module_map=module_map) 2025-05-07T20:32:37.0602685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0603034Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0603293Z E ^ 2025-05-07T20:32:37.0603743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0604179Z 2025-05-07T20:32:37.0604666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0605172Z 2025-05-07T20:32:37.0605277Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0605727Z self=, 2025-05-07T20:32:37.0606116Z T=16384, 2025-05-07T20:32:37.0606302Z D=7168, 2025-05-07T20:32:37.0606489Z scale_ub=1200.0, 2025-05-07T20:32:37.0606706Z contiguous=False, 2025-05-07T20:32:37.0606923Z compiled=False, 2025-05-07T20:32:37.0607117Z ) 2025-05-07T20:32:37.0607429Z self = 2025-05-07T20:32:37.0607925Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.0608199Z 2025-05-07T20:32:37.0608272Z @given( 2025-05-07T20:32:37.0608504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0608815Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0609115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0609439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0609763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0610041Z ) 2025-05-07T20:32:37.0610378Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0610817Z def test_silu_mul_quant( 2025-05-07T20:32:37.0611052Z self, 2025-05-07T20:32:37.0611241Z T: int, 2025-05-07T20:32:37.0611436Z D: int, 2025-05-07T20:32:37.0611652Z scale_ub: Optional[float], 2025-05-07T20:32:37.0611915Z contiguous: bool, 2025-05-07T20:32:37.0612228Z compiled: bool, 2025-05-07T20:32:37.0612450Z ) -> None: 2025-05-07T20:32:37.0612659Z torch.manual_seed(2025) 2025-05-07T20:32:37.0612903Z 2025-05-07T20:32:37.0613174Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0613509Z 2025-05-07T20:32:37.0613791Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0614080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0614376Z x = x_sign * x_clamp 2025-05-07T20:32:37.0614684Z x0 = x[:, :D] 2025-05-07T20:32:37.0614894Z x1 = x[:, D:] 2025-05-07T20:32:37.0615102Z 2025-05-07T20:32:37.0615276Z if contiguous: 2025-05-07T20:32:37.0615503Z x0 = x0.contiguous() 2025-05-07T20:32:37.0615751Z x1 = x1.contiguous() 2025-05-07T20:32:37.0615979Z 2025-05-07T20:32:37.0616167Z if scale_ub is not None: 2025-05-07T20:32:37.0616431Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0616754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0617053Z ) 2025-05-07T20:32:37.0617239Z else: 2025-05-07T20:32:37.0617440Z scale_ub_tensor = None 2025-05-07T20:32:37.0617685Z 2025-05-07T20:32:37.0617953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0618260Z op = silu_mul_quant 2025-05-07T20:32:37.0618503Z if compiled: 2025-05-07T20:32:37.0618744Z op = torch.compile(op) 2025-05-07T20:32:37.0619030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0619300Z 2025-05-07T20:32:37.0619492Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0619651Z 2025-05-07T20:32:37.0619751Z moe/activation_test.py:117: 2025-05-07T20:32:37.0620031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0620360Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0620634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0621304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0621977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0622510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0623225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0623874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0624405Z kernel = self.compile( 2025-05-07T20:32:37.0624934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0625608Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0626011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0626247Z 2025-05-07T20:32:37.0626449Z self = 2025-05-07T20:32:37.0627513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0628843Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c42e8ae0>} 2025-05-07T20:32:37.0630160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0631160Z context = 2025-05-07T20:32:37.0631440Z 2025-05-07T20:32:37.0631656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0632169Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0632625Z module_map=module_map) 2025-05-07T20:32:37.0632993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0633343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0633594Z E ^ 2025-05-07T20:32:37.0634048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0634560Z 2025-05-07T20:32:37.0634973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0635501Z 2025-05-07T20:32:37.0635620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0636035Z self=, 2025-05-07T20:32:37.0636433Z T=1, 2025-05-07T20:32:37.0636615Z D=7168, 2025-05-07T20:32:37.0636799Z scale_ub=None, 2025-05-07T20:32:37.0637009Z contiguous=True, 2025-05-07T20:32:37.0637225Z compiled=True, 2025-05-07T20:32:37.0637415Z ) 2025-05-07T20:32:37.0637771Z self = 2025-05-07T20:32:37.0638246Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0638501Z 2025-05-07T20:32:37.0638582Z @given( 2025-05-07T20:32:37.0638802Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0639109Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0639408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0639725Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0640048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0640329Z ) 2025-05-07T20:32:37.0640667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0641101Z def test_silu_mul_quant( 2025-05-07T20:32:37.0641337Z self, 2025-05-07T20:32:37.0641523Z T: int, 2025-05-07T20:32:37.0641715Z D: int, 2025-05-07T20:32:37.0641928Z scale_ub: Optional[float], 2025-05-07T20:32:37.0642187Z contiguous: bool, 2025-05-07T20:32:37.0642424Z compiled: bool, 2025-05-07T20:32:37.0642684Z ) -> None: 2025-05-07T20:32:37.0642894Z torch.manual_seed(2025) 2025-05-07T20:32:37.0643125Z 2025-05-07T20:32:37.0643395Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0643732Z 2025-05-07T20:32:37.0643915Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0644203Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0644511Z x = x_sign * x_clamp 2025-05-07T20:32:37.0644742Z x0 = x[:, :D] 2025-05-07T20:32:37.0644955Z x1 = x[:, D:] 2025-05-07T20:32:37.0645160Z 2025-05-07T20:32:37.0645341Z if contiguous: 2025-05-07T20:32:37.0645594Z x0 = x0.contiguous() 2025-05-07T20:32:37.0645877Z x1 = x1.contiguous() 2025-05-07T20:32:37.0646115Z 2025-05-07T20:32:37.0646302Z if scale_ub is not None: 2025-05-07T20:32:37.0646567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0646900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0647205Z ) 2025-05-07T20:32:37.0647391Z else: 2025-05-07T20:32:37.0647601Z scale_ub_tensor = None 2025-05-07T20:32:37.0647845Z 2025-05-07T20:32:37.0648066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0648376Z op = silu_mul_quant 2025-05-07T20:32:37.0648623Z if compiled: 2025-05-07T20:32:37.0648861Z op = torch.compile(op) 2025-05-07T20:32:37.0649156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0649423Z 2025-05-07T20:32:37.0649614Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0649938Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0650223Z 2025-05-07T20:32:37.0650457Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0650781Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0651070Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0651383Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0651729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0652075Z 2025-05-07T20:32:37.0652271Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0652460Z 2025-05-07T20:32:37.0652554Z moe/activation_test.py:126: 2025-05-07T20:32:37.0652841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0653167Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0653485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0654344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0655076Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0655657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0656332Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0657003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0657716Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0658427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0659044Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0659637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0660142Z fn() 2025-05-07T20:32:37.0660639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0661203Z self.fn.run( 2025-05-07T20:32:37.0661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0662229Z kernel = self.compile( 2025-05-07T20:32:37.0662755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0663396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0663792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0664014Z 2025-05-07T20:32:37.0664223Z self = 2025-05-07T20:32:37.0665279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0666625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411a340>} 2025-05-07T20:32:37.0667943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0668947Z context = 2025-05-07T20:32:37.0669228Z 2025-05-07T20:32:37.0669400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0669907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0670417Z module_map=module_map) 2025-05-07T20:32:37.0670780Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0671127Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0671406Z E ^ 2025-05-07T20:32:37.0671865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0672300Z 2025-05-07T20:32:37.0672711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0673258Z 2025-05-07T20:32:37.0673361Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0673764Z self=, 2025-05-07T20:32:37.0674166Z T=4096, 2025-05-07T20:32:37.0674348Z D=5120, 2025-05-07T20:32:37.0674537Z scale_ub=None, 2025-05-07T20:32:37.0674748Z contiguous=False, 2025-05-07T20:32:37.0674970Z compiled=False, 2025-05-07T20:32:37.0675172Z ) 2025-05-07T20:32:37.0675508Z self = 2025-05-07T20:32:37.0676013Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.0676327Z 2025-05-07T20:32:37.0676406Z @given( 2025-05-07T20:32:37.0676633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0676943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0677239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0677568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0677893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0678165Z ) 2025-05-07T20:32:37.0678513Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0678947Z def test_silu_mul_quant( 2025-05-07T20:32:37.0679188Z self, 2025-05-07T20:32:37.0679377Z T: int, 2025-05-07T20:32:37.0679576Z D: int, 2025-05-07T20:32:37.0679791Z scale_ub: Optional[float], 2025-05-07T20:32:37.0679880Z contiguous: bool, 2025-05-07T20:32:37.0679965Z compiled: bool, 2025-05-07T20:32:37.0680048Z ) -> None: 2025-05-07T20:32:37.0680148Z torch.manual_seed(2025) 2025-05-07T20:32:37.0680222Z 2025-05-07T20:32:37.0680438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0680511Z 2025-05-07T20:32:37.0680601Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0680732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0680819Z x = x_sign * x_clamp 2025-05-07T20:32:37.0680896Z x0 = x[:, :D] 2025-05-07T20:32:37.0680978Z x1 = x[:, D:] 2025-05-07T20:32:37.0681050Z 2025-05-07T20:32:37.0681138Z if contiguous: 2025-05-07T20:32:37.0681227Z x0 = x0.contiguous() 2025-05-07T20:32:37.0681313Z x1 = x1.contiguous() 2025-05-07T20:32:37.0681391Z 2025-05-07T20:32:37.0681478Z if scale_ub is not None: 2025-05-07T20:32:37.0681582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0681721Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0681796Z ) 2025-05-07T20:32:37.0681874Z else: 2025-05-07T20:32:37.0681976Z scale_ub_tensor = None 2025-05-07T20:32:37.0682047Z 2025-05-07T20:32:37.0682174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0682269Z op = silu_mul_quant 2025-05-07T20:32:37.0682354Z if compiled: 2025-05-07T20:32:37.0682453Z op = torch.compile(op) 2025-05-07T20:32:37.0682561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0682634Z 2025-05-07T20:32:37.0682732Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0682737Z 2025-05-07T20:32:37.0682830Z moe/activation_test.py:117: 2025-05-07T20:32:37.0682957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0683129Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0683229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0683719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0683825Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0684177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0684447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0684778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0684867Z kernel = self.compile( 2025-05-07T20:32:37.0685248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0685422Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0685553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0685558Z 2025-05-07T20:32:37.0685797Z self = 2025-05-07T20:32:37.0686562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0687066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411b100>} 2025-05-07T20:32:37.0687798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0687995Z context = 2025-05-07T20:32:37.0687999Z 2025-05-07T20:32:37.0688161Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0688419Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0688571Z module_map=module_map) 2025-05-07T20:32:37.0688733Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0688838Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0688912Z E ^ 2025-05-07T20:32:37.0689257Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0689263Z 2025-05-07T20:32:37.0689674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0689678Z 2025-05-07T20:32:37.0689780Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0690004Z self=, 2025-05-07T20:32:37.0690079Z T=4096, 2025-05-07T20:32:37.0690157Z D=7168, 2025-05-07T20:32:37.0690241Z scale_ub=None, 2025-05-07T20:32:37.0690327Z contiguous=False, 2025-05-07T20:32:37.0690409Z compiled=False, 2025-05-07T20:32:37.0690489Z ) 2025-05-07T20:32:37.0690702Z self = 2025-05-07T20:32:37.0690877Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.0690882Z 2025-05-07T20:32:37.0690962Z @given( 2025-05-07T20:32:37.0691078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0691173Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0691292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0691405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0691592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0691664Z ) 2025-05-07T20:32:37.0691902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0691999Z def test_silu_mul_quant( 2025-05-07T20:32:37.0692076Z self, 2025-05-07T20:32:37.0692151Z T: int, 2025-05-07T20:32:37.0692233Z D: int, 2025-05-07T20:32:37.0692328Z scale_ub: Optional[float], 2025-05-07T20:32:37.0692415Z contiguous: bool, 2025-05-07T20:32:37.0692543Z compiled: bool, 2025-05-07T20:32:37.0692621Z ) -> None: 2025-05-07T20:32:37.0692714Z torch.manual_seed(2025) 2025-05-07T20:32:37.0692788Z 2025-05-07T20:32:37.0692952Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0693025Z 2025-05-07T20:32:37.0693113Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0693235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0693330Z x = x_sign * x_clamp 2025-05-07T20:32:37.0693405Z x0 = x[:, :D] 2025-05-07T20:32:37.0693482Z x1 = x[:, D:] 2025-05-07T20:32:37.0693559Z 2025-05-07T20:32:37.0693640Z if contiguous: 2025-05-07T20:32:37.0693812Z x0 = x0.contiguous() 2025-05-07T20:32:37.0693948Z x1 = x1.contiguous() 2025-05-07T20:32:37.0694019Z 2025-05-07T20:32:37.0694111Z if scale_ub is not None: 2025-05-07T20:32:37.0694219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0694353Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0694433Z ) 2025-05-07T20:32:37.0694508Z else: 2025-05-07T20:32:37.0694598Z scale_ub_tensor = None 2025-05-07T20:32:37.0694676Z 2025-05-07T20:32:37.0694801Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0694890Z op = silu_mul_quant 2025-05-07T20:32:37.0694978Z if compiled: 2025-05-07T20:32:37.0695078Z op = torch.compile(op) 2025-05-07T20:32:37.0695180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0695254Z 2025-05-07T20:32:37.0695342Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0695346Z 2025-05-07T20:32:37.0695440Z moe/activation_test.py:117: 2025-05-07T20:32:37.0695626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0695723Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0695825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0696316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0696409Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0696768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0696987Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0697333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0697426Z kernel = self.compile( 2025-05-07T20:32:37.0697803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0697980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0698106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0698113Z 2025-05-07T20:32:37.0698576Z self = 2025-05-07T20:32:37.0699354Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0699848Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78c411a840>} 2025-05-07T20:32:37.0700688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0700880Z context = 2025-05-07T20:32:37.0700885Z 2025-05-07T20:32:37.0701053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0701373Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0701480Z module_map=module_map) 2025-05-07T20:32:37.0701645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0701741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0701815Z E ^ 2025-05-07T20:32:37.0702168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0702172Z 2025-05-07T20:32:37.0702575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0702641Z 2025-05-07T20:32:37.0702753Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0702969Z self=, 2025-05-07T20:32:37.0703045Z T=128, 2025-05-07T20:32:37.0703130Z D=7168, 2025-05-07T20:32:37.0703213Z scale_ub=None, 2025-05-07T20:32:37.0703297Z contiguous=False, 2025-05-07T20:32:37.0703384Z compiled=True, 2025-05-07T20:32:37.0703456Z ) 2025-05-07T20:32:37.0703677Z self = 2025-05-07T20:32:37.0703841Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0703846Z 2025-05-07T20:32:37.0703925Z @given( 2025-05-07T20:32:37.0704047Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0704146Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0704259Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0704383Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0704566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0704641Z ) 2025-05-07T20:32:37.0704888Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0704982Z def test_silu_mul_quant( 2025-05-07T20:32:37.0705067Z self, 2025-05-07T20:32:37.0705141Z T: int, 2025-05-07T20:32:37.0705214Z D: int, 2025-05-07T20:32:37.0705315Z scale_ub: Optional[float], 2025-05-07T20:32:37.0705401Z contiguous: bool, 2025-05-07T20:32:37.0705484Z compiled: bool, 2025-05-07T20:32:37.0705563Z ) -> None: 2025-05-07T20:32:37.0705654Z torch.manual_seed(2025) 2025-05-07T20:32:37.0705729Z 2025-05-07T20:32:37.0705898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0705968Z 2025-05-07T20:32:37.0706057Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0706186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0706273Z x = x_sign * x_clamp 2025-05-07T20:32:37.0706360Z x0 = x[:, :D] 2025-05-07T20:32:37.0706437Z x1 = x[:, D:] 2025-05-07T20:32:37.0706507Z 2025-05-07T20:32:37.0706597Z if contiguous: 2025-05-07T20:32:37.0706687Z x0 = x0.contiguous() 2025-05-07T20:32:37.0706774Z x1 = x1.contiguous() 2025-05-07T20:32:37.0706852Z 2025-05-07T20:32:37.0706939Z if scale_ub is not None: 2025-05-07T20:32:37.0707041Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0707180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0707253Z ) 2025-05-07T20:32:37.0707372Z else: 2025-05-07T20:32:37.0707470Z scale_ub_tensor = None 2025-05-07T20:32:37.0707541Z 2025-05-07T20:32:37.0707667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0707761Z op = silu_mul_quant 2025-05-07T20:32:37.0707842Z if compiled: 2025-05-07T20:32:37.0707953Z op = torch.compile(op) 2025-05-07T20:32:37.0708058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0708130Z 2025-05-07T20:32:37.0708228Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0708389Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0708461Z 2025-05-07T20:32:37.0708598Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0708697Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0708793Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0708918Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0709053Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0709130Z 2025-05-07T20:32:37.0709227Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0709232Z 2025-05-07T20:32:37.0709326Z moe/activation_test.py:126: 2025-05-07T20:32:37.0709501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0709606Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0709736Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0710285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0710389Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0710750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0710965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0711326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0711586Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0711993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0712156Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0712496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0712575Z fn() 2025-05-07T20:32:37.0720701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0720810Z self.fn.run( 2025-05-07T20:32:37.0721162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0721269Z kernel = self.compile( 2025-05-07T20:32:37.0721652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0721827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0721964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0721972Z 2025-05-07T20:32:37.0722179Z self = 2025-05-07T20:32:37.0722951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0723453Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf7bf060>} 2025-05-07T20:32:37.0724277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0724465Z context = 2025-05-07T20:32:37.0724473Z 2025-05-07T20:32:37.0724639Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0724906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0725060Z module_map=module_map) 2025-05-07T20:32:37.0725232Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0725337Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0725415Z E ^ 2025-05-07T20:32:37.0725773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0725781Z 2025-05-07T20:32:37.0726185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0726190Z 2025-05-07T20:32:37.0726293Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0726561Z self=, 2025-05-07T20:32:37.0726642Z T=128, 2025-05-07T20:32:37.0726732Z D=7168, 2025-05-07T20:32:37.0726816Z scale_ub=None, 2025-05-07T20:32:37.0726904Z contiguous=False, 2025-05-07T20:32:37.0727000Z compiled=False, 2025-05-07T20:32:37.0727075Z ) 2025-05-07T20:32:37.0727294Z self = 2025-05-07T20:32:37.0727470Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.0727475Z 2025-05-07T20:32:37.0727553Z @given( 2025-05-07T20:32:37.0727675Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0727786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0727908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0728034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0728148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0728224Z ) 2025-05-07T20:32:37.0728550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0728647Z def test_silu_mul_quant( 2025-05-07T20:32:37.0728727Z self, 2025-05-07T20:32:37.0728812Z T: int, 2025-05-07T20:32:37.0728891Z D: int, 2025-05-07T20:32:37.0728989Z scale_ub: Optional[float], 2025-05-07T20:32:37.0729084Z contiguous: bool, 2025-05-07T20:32:37.0729172Z compiled: bool, 2025-05-07T20:32:37.0729251Z ) -> None: 2025-05-07T20:32:37.0729352Z torch.manual_seed(2025) 2025-05-07T20:32:37.0729428Z 2025-05-07T20:32:37.0729604Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0729682Z 2025-05-07T20:32:37.0729774Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0729907Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0729999Z x = x_sign * x_clamp 2025-05-07T20:32:37.0730081Z x0 = x[:, :D] 2025-05-07T20:32:37.0730169Z x1 = x[:, D:] 2025-05-07T20:32:37.0730244Z 2025-05-07T20:32:37.0730331Z if contiguous: 2025-05-07T20:32:37.0730431Z x0 = x0.contiguous() 2025-05-07T20:32:37.0730523Z x1 = x1.contiguous() 2025-05-07T20:32:37.0730601Z 2025-05-07T20:32:37.0730700Z if scale_ub is not None: 2025-05-07T20:32:37.0730806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0730944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0731021Z ) 2025-05-07T20:32:37.0731097Z else: 2025-05-07T20:32:37.0731197Z scale_ub_tensor = None 2025-05-07T20:32:37.0731271Z 2025-05-07T20:32:37.0731398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0731543Z op = silu_mul_quant 2025-05-07T20:32:37.0731628Z if compiled: 2025-05-07T20:32:37.0731729Z op = torch.compile(op) 2025-05-07T20:32:37.0731843Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0731920Z 2025-05-07T20:32:37.0732013Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0732021Z 2025-05-07T20:32:37.0732126Z moe/activation_test.py:117: 2025-05-07T20:32:37.0732254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0732402Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0732504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0732997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0733103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0733458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0733784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0734173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0734268Z kernel = self.compile( 2025-05-07T20:32:37.0734658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0734835Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0734966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0734971Z 2025-05-07T20:32:37.0735175Z self = 2025-05-07T20:32:37.0735951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0736456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf909e40>} 2025-05-07T20:32:37.0737225Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0737426Z context = 2025-05-07T20:32:37.0737431Z 2025-05-07T20:32:37.0737593Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0737857Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0737966Z module_map=module_map) 2025-05-07T20:32:37.0738131Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0738237Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0738314Z E ^ 2025-05-07T20:32:37.0738666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0738674Z 2025-05-07T20:32:37.0739082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0739086Z 2025-05-07T20:32:37.0739191Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0739415Z self=, 2025-05-07T20:32:37.0739491Z T=4096, 2025-05-07T20:32:37.0739568Z D=5120, 2025-05-07T20:32:37.0739657Z scale_ub=1200.0, 2025-05-07T20:32:37.0739741Z contiguous=True, 2025-05-07T20:32:37.0739831Z compiled=False, 2025-05-07T20:32:37.0739906Z ) 2025-05-07T20:32:37.0740121Z self = 2025-05-07T20:32:37.0740347Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0740352Z 2025-05-07T20:32:37.0740427Z @given( 2025-05-07T20:32:37.0740546Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0740653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0740774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0740890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0741050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0741127Z ) 2025-05-07T20:32:37.0741374Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0741467Z def test_silu_mul_quant( 2025-05-07T20:32:37.0741542Z self, 2025-05-07T20:32:37.0741623Z T: int, 2025-05-07T20:32:37.0741698Z D: int, 2025-05-07T20:32:37.0741796Z scale_ub: Optional[float], 2025-05-07T20:32:37.0741890Z contiguous: bool, 2025-05-07T20:32:37.0741977Z compiled: bool, 2025-05-07T20:32:37.0742055Z ) -> None: 2025-05-07T20:32:37.0742152Z torch.manual_seed(2025) 2025-05-07T20:32:37.0742224Z 2025-05-07T20:32:37.0742431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0742511Z 2025-05-07T20:32:37.0742603Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0742732Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0742820Z x = x_sign * x_clamp 2025-05-07T20:32:37.0742903Z x0 = x[:, :D] 2025-05-07T20:32:37.0742988Z x1 = x[:, D:] 2025-05-07T20:32:37.0743061Z 2025-05-07T20:32:37.0743144Z if contiguous: 2025-05-07T20:32:37.0743240Z x0 = x0.contiguous() 2025-05-07T20:32:37.0743329Z x1 = x1.contiguous() 2025-05-07T20:32:37.0743401Z 2025-05-07T20:32:37.0743497Z if scale_ub is not None: 2025-05-07T20:32:37.0743601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0743737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0743816Z ) 2025-05-07T20:32:37.0743892Z else: 2025-05-07T20:32:37.0743988Z scale_ub_tensor = None 2025-05-07T20:32:37.0744069Z 2025-05-07T20:32:37.0744199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0744339Z op = silu_mul_quant 2025-05-07T20:32:37.0744424Z if compiled: 2025-05-07T20:32:37.0744521Z op = torch.compile(op) 2025-05-07T20:32:37.0744634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0744706Z 2025-05-07T20:32:37.0744796Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0744800Z 2025-05-07T20:32:37.0744904Z moe/activation_test.py:117: 2025-05-07T20:32:37.0745032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0745130Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0745235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0745725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0745828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0746188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0746409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0746753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0746846Z kernel = self.compile( 2025-05-07T20:32:37.0747229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0747402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0747532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0747578Z 2025-05-07T20:32:37.0747787Z self = 2025-05-07T20:32:37.0748553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0749055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf90a5c0>} 2025-05-07T20:32:37.0749825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0750013Z context = 2025-05-07T20:32:37.0750020Z 2025-05-07T20:32:37.0750187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0750443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0750595Z module_map=module_map) 2025-05-07T20:32:37.0750761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0750860Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0750943Z E ^ 2025-05-07T20:32:37.0751291Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0751299Z 2025-05-07T20:32:37.0751705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0751716Z 2025-05-07T20:32:37.0751819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0752039Z self=, 2025-05-07T20:32:37.0752126Z T=1, 2025-05-07T20:32:37.0752202Z D=5120, 2025-05-07T20:32:37.0752287Z scale_ub=None, 2025-05-07T20:32:37.0752382Z contiguous=True, 2025-05-07T20:32:37.0752465Z compiled=True, 2025-05-07T20:32:37.0752540Z ) 2025-05-07T20:32:37.0752770Z self = 2025-05-07T20:32:37.0752971Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0752976Z 2025-05-07T20:32:37.0753060Z @given( 2025-05-07T20:32:37.0753184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0753284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0753403Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0753520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0753634Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0753714Z ) 2025-05-07T20:32:37.0753958Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0754055Z def test_silu_mul_quant( 2025-05-07T20:32:37.0754138Z self, 2025-05-07T20:32:37.0754218Z T: int, 2025-05-07T20:32:37.0754296Z D: int, 2025-05-07T20:32:37.0754404Z scale_ub: Optional[float], 2025-05-07T20:32:37.0754494Z contiguous: bool, 2025-05-07T20:32:37.0754590Z compiled: bool, 2025-05-07T20:32:37.0754668Z ) -> None: 2025-05-07T20:32:37.0754762Z torch.manual_seed(2025) 2025-05-07T20:32:37.0754844Z 2025-05-07T20:32:37.0755009Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0755084Z 2025-05-07T20:32:37.0755185Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0755308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0755410Z x = x_sign * x_clamp 2025-05-07T20:32:37.0755511Z x0 = x[:, :D] 2025-05-07T20:32:37.0755604Z x1 = x[:, D:] 2025-05-07T20:32:37.0755733Z 2025-05-07T20:32:37.0755823Z if contiguous: 2025-05-07T20:32:37.0755917Z x0 = x0.contiguous() 2025-05-07T20:32:37.0756013Z x1 = x1.contiguous() 2025-05-07T20:32:37.0756087Z 2025-05-07T20:32:37.0756179Z if scale_ub is not None: 2025-05-07T20:32:37.0756293Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0756430Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0756507Z ) 2025-05-07T20:32:37.0756590Z else: 2025-05-07T20:32:37.0756727Z scale_ub_tensor = None 2025-05-07T20:32:37.0756804Z 2025-05-07T20:32:37.0756938Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0757031Z op = silu_mul_quant 2025-05-07T20:32:37.0757116Z if compiled: 2025-05-07T20:32:37.0757221Z op = torch.compile(op) 2025-05-07T20:32:37.0757326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0757405Z 2025-05-07T20:32:37.0757499Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0757619Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0757699Z 2025-05-07T20:32:37.0757833Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0758047Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0758156Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0758279Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0758417Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0758499Z 2025-05-07T20:32:37.0758599Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0758604Z 2025-05-07T20:32:37.0758706Z moe/activation_test.py:126: 2025-05-07T20:32:37.0758834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0758938Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0759076Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0759625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0759725Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0760130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0760351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0760722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0760972Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0761343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0761513Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0761850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0761926Z fn() 2025-05-07T20:32:37.0762332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0762415Z self.fn.run( 2025-05-07T20:32:37.0762756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0762848Z kernel = self.compile( 2025-05-07T20:32:37.0763226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0763402Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0763529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0763533Z 2025-05-07T20:32:37.0763743Z self = 2025-05-07T20:32:37.0764553Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0765055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf90b240>} 2025-05-07T20:32:37.0765844Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0766071Z context = 2025-05-07T20:32:37.0766076Z 2025-05-07T20:32:37.0766242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0766498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0766607Z module_map=module_map) 2025-05-07T20:32:37.0766773Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0766876Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0766997Z E ^ 2025-05-07T20:32:37.0767350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0767355Z 2025-05-07T20:32:37.0767760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0767765Z 2025-05-07T20:32:37.0767873Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0768094Z self=, 2025-05-07T20:32:37.0768169Z T=2048, 2025-05-07T20:32:37.0768251Z D=5120, 2025-05-07T20:32:37.0768333Z scale_ub=None, 2025-05-07T20:32:37.0768427Z contiguous=True, 2025-05-07T20:32:37.0768509Z compiled=True, 2025-05-07T20:32:37.0768582Z ) 2025-05-07T20:32:37.0768802Z self = 2025-05-07T20:32:37.0768972Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0768977Z 2025-05-07T20:32:37.0769055Z @given( 2025-05-07T20:32:37.0769224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0769324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0769441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0769562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0769674Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0769752Z ) 2025-05-07T20:32:37.0769993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0770087Z def test_silu_mul_quant( 2025-05-07T20:32:37.0770170Z self, 2025-05-07T20:32:37.0770246Z T: int, 2025-05-07T20:32:37.0770322Z D: int, 2025-05-07T20:32:37.0770426Z scale_ub: Optional[float], 2025-05-07T20:32:37.0770518Z contiguous: bool, 2025-05-07T20:32:37.0770601Z compiled: bool, 2025-05-07T20:32:37.0770686Z ) -> None: 2025-05-07T20:32:37.0770779Z torch.manual_seed(2025) 2025-05-07T20:32:37.0770856Z 2025-05-07T20:32:37.0771027Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0771100Z 2025-05-07T20:32:37.0771203Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0771324Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0771412Z x = x_sign * x_clamp 2025-05-07T20:32:37.0771497Z x0 = x[:, :D] 2025-05-07T20:32:37.0771574Z x1 = x[:, D:] 2025-05-07T20:32:37.0771646Z 2025-05-07T20:32:37.0771733Z if contiguous: 2025-05-07T20:32:37.0771823Z x0 = x0.contiguous() 2025-05-07T20:32:37.0771958Z x1 = x1.contiguous() 2025-05-07T20:32:37.0772034Z 2025-05-07T20:32:37.0772123Z if scale_ub is not None: 2025-05-07T20:32:37.0772226Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0772363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0772440Z ) 2025-05-07T20:32:37.0772522Z else: 2025-05-07T20:32:37.0772616Z scale_ub_tensor = None 2025-05-07T20:32:37.0772688Z 2025-05-07T20:32:37.0772819Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0772951Z op = silu_mul_quant 2025-05-07T20:32:37.0773034Z if compiled: 2025-05-07T20:32:37.0773140Z op = torch.compile(op) 2025-05-07T20:32:37.0773244Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0773316Z 2025-05-07T20:32:37.0773412Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0773530Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0773606Z 2025-05-07T20:32:37.0773820Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0773919Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0774022Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0774187Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0774327Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0774408Z 2025-05-07T20:32:37.0774507Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0774515Z 2025-05-07T20:32:37.0774611Z moe/activation_test.py:126: 2025-05-07T20:32:37.0774743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0774847Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0774977Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0775527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0775630Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0775994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0776220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0776621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0776877Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0777247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0777416Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0777752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0777833Z fn() 2025-05-07T20:32:37.0778235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0778318Z self.fn.run( 2025-05-07T20:32:37.0778652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0778753Z kernel = self.compile( 2025-05-07T20:32:37.0779129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0779306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0779433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0779438Z 2025-05-07T20:32:37.0779639Z self = 2025-05-07T20:32:37.0780406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0780946Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf9f2d40>} 2025-05-07T20:32:37.0781685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0781918Z context = 2025-05-07T20:32:37.0781923Z 2025-05-07T20:32:37.0782092Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0782352Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0782464Z module_map=module_map) 2025-05-07T20:32:37.0782633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0782743Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0782821Z E ^ 2025-05-07T20:32:37.0783210Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0783215Z 2025-05-07T20:32:37.0783633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0783641Z 2025-05-07T20:32:37.0783745Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0783970Z self=, 2025-05-07T20:32:37.0784047Z T=128, 2025-05-07T20:32:37.0784121Z D=5120, 2025-05-07T20:32:37.0784208Z scale_ub=None, 2025-05-07T20:32:37.0784293Z contiguous=True, 2025-05-07T20:32:37.0784376Z compiled=True, 2025-05-07T20:32:37.0784452Z ) 2025-05-07T20:32:37.0784666Z self = 2025-05-07T20:32:37.0784832Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0784837Z 2025-05-07T20:32:37.0784918Z @given( 2025-05-07T20:32:37.0785039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0785144Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0785299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0785416Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0785534Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0785608Z ) 2025-05-07T20:32:37.0785850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0785948Z def test_silu_mul_quant( 2025-05-07T20:32:37.0786023Z self, 2025-05-07T20:32:37.0786098Z T: int, 2025-05-07T20:32:37.0786178Z D: int, 2025-05-07T20:32:37.0786276Z scale_ub: Optional[float], 2025-05-07T20:32:37.0786369Z contiguous: bool, 2025-05-07T20:32:37.0786458Z compiled: bool, 2025-05-07T20:32:37.0786534Z ) -> None: 2025-05-07T20:32:37.0786631Z torch.manual_seed(2025) 2025-05-07T20:32:37.0786703Z 2025-05-07T20:32:37.0786871Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0786948Z 2025-05-07T20:32:37.0787041Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0787163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0787257Z x = x_sign * x_clamp 2025-05-07T20:32:37.0787337Z x0 = x[:, :D] 2025-05-07T20:32:37.0787416Z x1 = x[:, D:] 2025-05-07T20:32:37.0787493Z 2025-05-07T20:32:37.0787575Z if contiguous: 2025-05-07T20:32:37.0787666Z x0 = x0.contiguous() 2025-05-07T20:32:37.0787759Z x1 = x1.contiguous() 2025-05-07T20:32:37.0787829Z 2025-05-07T20:32:37.0787916Z if scale_ub is not None: 2025-05-07T20:32:37.0788076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0788208Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0788287Z ) 2025-05-07T20:32:37.0788362Z else: 2025-05-07T20:32:37.0788455Z scale_ub_tensor = None 2025-05-07T20:32:37.0788534Z 2025-05-07T20:32:37.0788668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0788757Z op = silu_mul_quant 2025-05-07T20:32:37.0788847Z if compiled: 2025-05-07T20:32:37.0789010Z op = torch.compile(op) 2025-05-07T20:32:37.0789115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0789194Z 2025-05-07T20:32:37.0789283Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0789403Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0789480Z 2025-05-07T20:32:37.0789613Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0789721Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0789821Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0789942Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0790084Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0790155Z 2025-05-07T20:32:37.0790292Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0790299Z 2025-05-07T20:32:37.0790400Z moe/activation_test.py:126: 2025-05-07T20:32:37.0790528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0790640Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0790770Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0791315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0791418Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0791772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0791994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0792357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0792648Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0793022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0793188Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0793524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0793605Z fn() 2025-05-07T20:32:37.0794000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0794083Z self.fn.run( 2025-05-07T20:32:37.0794420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0794509Z kernel = self.compile( 2025-05-07T20:32:37.0794891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0795066Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0795191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0795198Z 2025-05-07T20:32:37.0795406Z self = 2025-05-07T20:32:37.0796169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0796711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf9f0680>} 2025-05-07T20:32:37.0797448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0797639Z context = 2025-05-07T20:32:37.0797684Z 2025-05-07T20:32:37.0797847Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0798105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0798435Z module_map=module_map) 2025-05-07T20:32:37.0798657Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0798779Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0798869Z E ^ 2025-05-07T20:32:37.0799218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0799223Z 2025-05-07T20:32:37.0799731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0799737Z 2025-05-07T20:32:37.0799847Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0800067Z self=, 2025-05-07T20:32:37.0800151Z T=4096, 2025-05-07T20:32:37.0800231Z D=5120, 2025-05-07T20:32:37.0800314Z scale_ub=None, 2025-05-07T20:32:37.0800403Z contiguous=True, 2025-05-07T20:32:37.0800485Z compiled=True, 2025-05-07T20:32:37.0800560Z ) 2025-05-07T20:32:37.0800782Z self = 2025-05-07T20:32:37.0800949Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0800956Z 2025-05-07T20:32:37.0801040Z @given( 2025-05-07T20:32:37.0801158Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0801258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0801379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0801498Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0801673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0801757Z ) 2025-05-07T20:32:37.0801999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0802099Z def test_silu_mul_quant( 2025-05-07T20:32:37.0802173Z self, 2025-05-07T20:32:37.0802249Z T: int, 2025-05-07T20:32:37.0802331Z D: int, 2025-05-07T20:32:37.0802428Z scale_ub: Optional[float], 2025-05-07T20:32:37.0802517Z contiguous: bool, 2025-05-07T20:32:37.0802607Z compiled: bool, 2025-05-07T20:32:37.0802685Z ) -> None: 2025-05-07T20:32:37.0802781Z torch.manual_seed(2025) 2025-05-07T20:32:37.0802859Z 2025-05-07T20:32:37.0803024Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0803096Z 2025-05-07T20:32:37.0803192Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0803317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0803410Z x = x_sign * x_clamp 2025-05-07T20:32:37.0803490Z x0 = x[:, :D] 2025-05-07T20:32:37.0803569Z x1 = x[:, D:] 2025-05-07T20:32:37.0803648Z 2025-05-07T20:32:37.0803733Z if contiguous: 2025-05-07T20:32:37.0803823Z x0 = x0.contiguous() 2025-05-07T20:32:37.0803914Z x1 = x1.contiguous() 2025-05-07T20:32:37.0803987Z 2025-05-07T20:32:37.0804077Z if scale_ub is not None: 2025-05-07T20:32:37.0804184Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0804317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0804460Z ) 2025-05-07T20:32:37.0804540Z else: 2025-05-07T20:32:37.0804634Z scale_ub_tensor = None 2025-05-07T20:32:37.0804706Z 2025-05-07T20:32:37.0804838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0804927Z op = silu_mul_quant 2025-05-07T20:32:37.0805018Z if compiled: 2025-05-07T20:32:37.0805121Z op = torch.compile(op) 2025-05-07T20:32:37.0805227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0805305Z 2025-05-07T20:32:37.0805461Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0805582Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0805658Z 2025-05-07T20:32:37.0805792Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0805893Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0805997Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0806119Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0806265Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0806338Z 2025-05-07T20:32:37.0806440Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0806444Z 2025-05-07T20:32:37.0806546Z moe/activation_test.py:126: 2025-05-07T20:32:37.0806717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0806824Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0806964Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0807512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0807617Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0807972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0808190Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0808560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0808812Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0809226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0809397Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0809736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0809818Z fn() 2025-05-07T20:32:37.0810212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0810293Z self.fn.run( 2025-05-07T20:32:37.0810632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0810727Z kernel = self.compile( 2025-05-07T20:32:37.0811102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0811283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0811414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0811418Z 2025-05-07T20:32:37.0811626Z self = 2025-05-07T20:32:37.0812392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0812893Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78becbe520>} 2025-05-07T20:32:37.0813739Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0813935Z context = 2025-05-07T20:32:37.0813943Z 2025-05-07T20:32:37.0814113Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0814372Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0814530Z module_map=module_map) 2025-05-07T20:32:37.0814692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0814795Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0814878Z E ^ 2025-05-07T20:32:37.0815228Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0815235Z 2025-05-07T20:32:37.0815639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0815648Z 2025-05-07T20:32:37.0815751Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0816013Z self=, 2025-05-07T20:32:37.0816097Z T=16384, 2025-05-07T20:32:37.0816171Z D=5120, 2025-05-07T20:32:37.0816253Z scale_ub=None, 2025-05-07T20:32:37.0816347Z contiguous=True, 2025-05-07T20:32:37.0816429Z compiled=True, 2025-05-07T20:32:37.0816501Z ) 2025-05-07T20:32:37.0816721Z self = 2025-05-07T20:32:37.0816893Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.0816897Z 2025-05-07T20:32:37.0816973Z @given( 2025-05-07T20:32:37.0817099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0817203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0817322Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0817437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0817551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0817635Z ) 2025-05-07T20:32:37.0817922Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0818017Z def test_silu_mul_quant( 2025-05-07T20:32:37.0818100Z self, 2025-05-07T20:32:37.0818176Z T: int, 2025-05-07T20:32:37.0818253Z D: int, 2025-05-07T20:32:37.0818356Z scale_ub: Optional[float], 2025-05-07T20:32:37.0818449Z contiguous: bool, 2025-05-07T20:32:37.0818539Z compiled: bool, 2025-05-07T20:32:37.0818617Z ) -> None: 2025-05-07T20:32:37.0818711Z torch.manual_seed(2025) 2025-05-07T20:32:37.0818788Z 2025-05-07T20:32:37.0818953Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0819028Z 2025-05-07T20:32:37.0819124Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0819248Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0819336Z x = x_sign * x_clamp 2025-05-07T20:32:37.0819424Z x0 = x[:, :D] 2025-05-07T20:32:37.0819503Z x1 = x[:, D:] 2025-05-07T20:32:37.0819576Z 2025-05-07T20:32:37.0819664Z if contiguous: 2025-05-07T20:32:37.0819755Z x0 = x0.contiguous() 2025-05-07T20:32:37.0819846Z x1 = x1.contiguous() 2025-05-07T20:32:37.0819922Z 2025-05-07T20:32:37.0820011Z if scale_ub is not None: 2025-05-07T20:32:37.0820122Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0820254Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0820329Z ) 2025-05-07T20:32:37.0820409Z else: 2025-05-07T20:32:37.0820501Z scale_ub_tensor = None 2025-05-07T20:32:37.0820644Z 2025-05-07T20:32:37.0820778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0820866Z op = silu_mul_quant 2025-05-07T20:32:37.0820952Z if compiled: 2025-05-07T20:32:37.0821056Z op = torch.compile(op) 2025-05-07T20:32:37.0821162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0821235Z 2025-05-07T20:32:37.0821332Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0821456Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0821573Z 2025-05-07T20:32:37.0821710Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0821810Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0821914Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0822034Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0822171Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0822249Z 2025-05-07T20:32:37.0822351Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0822355Z 2025-05-07T20:32:37.0822451Z moe/activation_test.py:126: 2025-05-07T20:32:37.0822584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0822728Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0822872Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0823418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0823520Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0823882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0824101Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0824467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0824721Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0825088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0825267Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0825685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0825767Z fn() 2025-05-07T20:32:37.0826168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0826248Z self.fn.run( 2025-05-07T20:32:37.0826585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0826677Z kernel = self.compile( 2025-05-07T20:32:37.0827053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0827232Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0827360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0827364Z 2025-05-07T20:32:37.0827575Z self = 2025-05-07T20:32:37.0828343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0828844Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bee302c0>} 2025-05-07T20:32:37.0829580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0829811Z context = 2025-05-07T20:32:37.0829816Z 2025-05-07T20:32:37.0829987Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0830249Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0830358Z module_map=module_map) 2025-05-07T20:32:37.0830569Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0830674Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0830752Z E ^ 2025-05-07T20:32:37.0831107Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0831111Z 2025-05-07T20:32:37.0831516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0831523Z 2025-05-07T20:32:37.0831630Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0831850Z self=, 2025-05-07T20:32:37.0831925Z T=1, 2025-05-07T20:32:37.0832043Z D=5120, 2025-05-07T20:32:37.0832128Z scale_ub=1200.0, 2025-05-07T20:32:37.0832217Z contiguous=True, 2025-05-07T20:32:37.0832306Z compiled=True, 2025-05-07T20:32:37.0832378Z ) 2025-05-07T20:32:37.0832603Z self = 2025-05-07T20:32:37.0832769Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.0832773Z 2025-05-07T20:32:37.0832849Z @given( 2025-05-07T20:32:37.0832975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0833072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0833185Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0833311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0833425Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0833503Z ) 2025-05-07T20:32:37.0833745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0833843Z def test_silu_mul_quant( 2025-05-07T20:32:37.0833970Z self, 2025-05-07T20:32:37.0834050Z T: int, 2025-05-07T20:32:37.0834129Z D: int, 2025-05-07T20:32:37.0834230Z scale_ub: Optional[float], 2025-05-07T20:32:37.0834322Z contiguous: bool, 2025-05-07T20:32:37.0834411Z compiled: bool, 2025-05-07T20:32:37.0834494Z ) -> None: 2025-05-07T20:32:37.0834588Z torch.manual_seed(2025) 2025-05-07T20:32:37.0834661Z 2025-05-07T20:32:37.0834831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0834905Z 2025-05-07T20:32:37.0834998Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0835129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0835218Z x = x_sign * x_clamp 2025-05-07T20:32:37.0835304Z x0 = x[:, :D] 2025-05-07T20:32:37.0835382Z x1 = x[:, D:] 2025-05-07T20:32:37.0835453Z 2025-05-07T20:32:37.0835543Z if contiguous: 2025-05-07T20:32:37.0835640Z x0 = x0.contiguous() 2025-05-07T20:32:37.0835732Z x1 = x1.contiguous() 2025-05-07T20:32:37.0835808Z 2025-05-07T20:32:37.0835897Z if scale_ub is not None: 2025-05-07T20:32:37.0836002Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0836141Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0836216Z ) 2025-05-07T20:32:37.0836292Z else: 2025-05-07T20:32:37.0836388Z scale_ub_tensor = None 2025-05-07T20:32:37.0836460Z 2025-05-07T20:32:37.0836594Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0836683Z op = silu_mul_quant 2025-05-07T20:32:37.0836816Z if compiled: 2025-05-07T20:32:37.0836920Z op = torch.compile(op) 2025-05-07T20:32:37.0837024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0837095Z 2025-05-07T20:32:37.0837189Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0837194Z 2025-05-07T20:32:37.0837290Z moe/activation_test.py:117: 2025-05-07T20:32:37.0837422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0837523Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0837666Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0838032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.0838123Z return fn(*args, **kwargs) 2025-05-07T20:32:37.0838609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0838709Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0839064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0839282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0839665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0839760Z kernel = self.compile( 2025-05-07T20:32:37.0840144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0840319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0840446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0840450Z 2025-05-07T20:32:37.0840657Z self = 2025-05-07T20:32:37.0841419Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0841960Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd18680>} 2025-05-07T20:32:37.0842694Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0842886Z context = 2025-05-07T20:32:37.0842895Z 2025-05-07T20:32:37.0843056Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0843313Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0843426Z module_map=module_map) 2025-05-07T20:32:37.0843587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0843693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0843777Z E ^ 2025-05-07T20:32:37.0855524Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0855542Z 2025-05-07T20:32:37.0855998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0856006Z 2025-05-07T20:32:37.0856115Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0856345Z self=, 2025-05-07T20:32:37.0856427Z T=1, 2025-05-07T20:32:37.0856505Z D=5120, 2025-05-07T20:32:37.0856597Z scale_ub=None, 2025-05-07T20:32:37.0856685Z contiguous=False, 2025-05-07T20:32:37.0856769Z compiled=True, 2025-05-07T20:32:37.0856930Z ) 2025-05-07T20:32:37.0857148Z self = 2025-05-07T20:32:37.0857318Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0857323Z 2025-05-07T20:32:37.0857399Z @given( 2025-05-07T20:32:37.0857523Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0857632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0857752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0857914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0858033Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0858109Z ) 2025-05-07T20:32:37.0858355Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0858450Z def test_silu_mul_quant( 2025-05-07T20:32:37.0858528Z self, 2025-05-07T20:32:37.0858610Z T: int, 2025-05-07T20:32:37.0858686Z D: int, 2025-05-07T20:32:37.0858789Z scale_ub: Optional[float], 2025-05-07T20:32:37.0858884Z contiguous: bool, 2025-05-07T20:32:37.0858969Z compiled: bool, 2025-05-07T20:32:37.0859049Z ) -> None: 2025-05-07T20:32:37.0859150Z torch.manual_seed(2025) 2025-05-07T20:32:37.0859265Z 2025-05-07T20:32:37.0859438Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0859518Z 2025-05-07T20:32:37.0859610Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0859736Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0859833Z x = x_sign * x_clamp 2025-05-07T20:32:37.0859915Z x0 = x[:, :D] 2025-05-07T20:32:37.0859996Z x1 = x[:, D:] 2025-05-07T20:32:37.0860070Z 2025-05-07T20:32:37.0860154Z if contiguous: 2025-05-07T20:32:37.0860251Z x0 = x0.contiguous() 2025-05-07T20:32:37.0860340Z x1 = x1.contiguous() 2025-05-07T20:32:37.0860411Z 2025-05-07T20:32:37.0860509Z if scale_ub is not None: 2025-05-07T20:32:37.0860615Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0860749Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0860833Z ) 2025-05-07T20:32:37.0860910Z else: 2025-05-07T20:32:37.0861005Z scale_ub_tensor = None 2025-05-07T20:32:37.0861082Z 2025-05-07T20:32:37.0861282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0861374Z op = silu_mul_quant 2025-05-07T20:32:37.0861468Z if compiled: 2025-05-07T20:32:37.0861570Z op = torch.compile(op) 2025-05-07T20:32:37.0861681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0861753Z 2025-05-07T20:32:37.0861843Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0861967Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0862040Z 2025-05-07T20:32:37.0862176Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0862287Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0862388Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0862510Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0862659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0862731Z 2025-05-07T20:32:37.0862838Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0862843Z 2025-05-07T20:32:37.0862942Z moe/activation_test.py:126: 2025-05-07T20:32:37.0863074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0863190Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0863327Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0863876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0863983Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0864387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0864621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0864991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0865248Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0865662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0865834Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0866181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0866260Z fn() 2025-05-07T20:32:37.0866653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0866744Z self.fn.run( 2025-05-07T20:32:37.0867080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0867173Z kernel = self.compile( 2025-05-07T20:32:37.0867598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0867774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0867917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0867921Z 2025-05-07T20:32:37.0868128Z self = 2025-05-07T20:32:37.0868897Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0869404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd02de0>} 2025-05-07T20:32:37.0870178Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0870376Z context = 2025-05-07T20:32:37.0870382Z 2025-05-07T20:32:37.0870546Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0870809Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0870919Z module_map=module_map) 2025-05-07T20:32:37.0871081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0871194Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0871270Z E ^ 2025-05-07T20:32:37.0871625Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0871630Z 2025-05-07T20:32:37.0872039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0872044Z 2025-05-07T20:32:37.0872146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0872374Z self=, 2025-05-07T20:32:37.0872451Z T=1, 2025-05-07T20:32:37.0872529Z D=5120, 2025-05-07T20:32:37.0872618Z scale_ub=None, 2025-05-07T20:32:37.0872702Z contiguous=True, 2025-05-07T20:32:37.0872790Z compiled=False, 2025-05-07T20:32:37.0872864Z ) 2025-05-07T20:32:37.0873080Z self = 2025-05-07T20:32:37.0873293Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.0873297Z 2025-05-07T20:32:37.0873373Z @given( 2025-05-07T20:32:37.0873493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0873596Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0873714Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0873833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0873949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0874068Z ) 2025-05-07T20:32:37.0874316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0874411Z def test_silu_mul_quant( 2025-05-07T20:32:37.0874486Z self, 2025-05-07T20:32:37.0874566Z T: int, 2025-05-07T20:32:37.0874642Z D: int, 2025-05-07T20:32:37.0874742Z scale_ub: Optional[float], 2025-05-07T20:32:37.0874834Z contiguous: bool, 2025-05-07T20:32:37.0874919Z compiled: bool, 2025-05-07T20:32:37.0875000Z ) -> None: 2025-05-07T20:32:37.0875098Z torch.manual_seed(2025) 2025-05-07T20:32:37.0875169Z 2025-05-07T20:32:37.0875334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0875409Z 2025-05-07T20:32:37.0875546Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0875678Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0875766Z x = x_sign * x_clamp 2025-05-07T20:32:37.0875846Z x0 = x[:, :D] 2025-05-07T20:32:37.0875940Z x1 = x[:, D:] 2025-05-07T20:32:37.0876011Z 2025-05-07T20:32:37.0876094Z if contiguous: 2025-05-07T20:32:37.0876190Z x0 = x0.contiguous() 2025-05-07T20:32:37.0876278Z x1 = x1.contiguous() 2025-05-07T20:32:37.0876350Z 2025-05-07T20:32:37.0876446Z if scale_ub is not None: 2025-05-07T20:32:37.0876552Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0876686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0876770Z ) 2025-05-07T20:32:37.0876847Z else: 2025-05-07T20:32:37.0876945Z scale_ub_tensor = None 2025-05-07T20:32:37.0877018Z 2025-05-07T20:32:37.0877145Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0877243Z op = silu_mul_quant 2025-05-07T20:32:37.0877374Z if compiled: 2025-05-07T20:32:37.0877476Z op = torch.compile(op) 2025-05-07T20:32:37.0877587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0877663Z 2025-05-07T20:32:37.0877753Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0877758Z 2025-05-07T20:32:37.0877859Z moe/activation_test.py:117: 2025-05-07T20:32:37.0877988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0878087Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0878192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0878685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0878787Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0879145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0879366Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0879708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0879804Z kernel = self.compile( 2025-05-07T20:32:37.0880186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0880360Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0880487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0880534Z 2025-05-07T20:32:37.0880742Z self = 2025-05-07T20:32:37.0881508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0882010Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be546a20>} 2025-05-07T20:32:37.0882784Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0882974Z context = 2025-05-07T20:32:37.0882979Z 2025-05-07T20:32:37.0883145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0883403Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0883517Z module_map=module_map) 2025-05-07T20:32:37.0883717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0883817Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0883904Z E ^ 2025-05-07T20:32:37.0884252Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0884259Z 2025-05-07T20:32:37.0884667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0884672Z 2025-05-07T20:32:37.0884774Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0884994Z self=, 2025-05-07T20:32:37.0885077Z T=128, 2025-05-07T20:32:37.0885156Z D=5120, 2025-05-07T20:32:37.0885238Z scale_ub=None, 2025-05-07T20:32:37.0885330Z contiguous=False, 2025-05-07T20:32:37.0885413Z compiled=True, 2025-05-07T20:32:37.0885487Z ) 2025-05-07T20:32:37.0885709Z self = 2025-05-07T20:32:37.0885920Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0885925Z 2025-05-07T20:32:37.0886007Z @given( 2025-05-07T20:32:37.0886128Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0886229Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0886348Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0886466Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0886581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0886659Z ) 2025-05-07T20:32:37.0886901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0886997Z def test_silu_mul_quant( 2025-05-07T20:32:37.0887077Z self, 2025-05-07T20:32:37.0887154Z T: int, 2025-05-07T20:32:37.0887235Z D: int, 2025-05-07T20:32:37.0887334Z scale_ub: Optional[float], 2025-05-07T20:32:37.0887424Z contiguous: bool, 2025-05-07T20:32:37.0887514Z compiled: bool, 2025-05-07T20:32:37.0887596Z ) -> None: 2025-05-07T20:32:37.0887690Z torch.manual_seed(2025) 2025-05-07T20:32:37.0887768Z 2025-05-07T20:32:37.0887934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0888013Z 2025-05-07T20:32:37.0888110Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0888235Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0888324Z x = x_sign * x_clamp 2025-05-07T20:32:37.0888409Z x0 = x[:, :D] 2025-05-07T20:32:37.0888489Z x1 = x[:, D:] 2025-05-07T20:32:37.0888561Z 2025-05-07T20:32:37.0888649Z if contiguous: 2025-05-07T20:32:37.0888786Z x0 = x0.contiguous() 2025-05-07T20:32:37.0888880Z x1 = x1.contiguous() 2025-05-07T20:32:37.0888953Z 2025-05-07T20:32:37.0889044Z if scale_ub is not None: 2025-05-07T20:32:37.0889151Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0889287Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0889366Z ) 2025-05-07T20:32:37.0889450Z else: 2025-05-07T20:32:37.0889543Z scale_ub_tensor = None 2025-05-07T20:32:37.0889660Z 2025-05-07T20:32:37.0889793Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0889883Z op = silu_mul_quant 2025-05-07T20:32:37.0889968Z if compiled: 2025-05-07T20:32:37.0890073Z op = torch.compile(op) 2025-05-07T20:32:37.0890178Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0890255Z 2025-05-07T20:32:37.0890345Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0890352Z 2025-05-07T20:32:37.0890449Z moe/activation_test.py:117: 2025-05-07T20:32:37.0890581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0890681Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0890780Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0891213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.0891308Z return fn(*args, **kwargs) 2025-05-07T20:32:37.0891799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0891902Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0892258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0892481Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0892820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0892914Z kernel = self.compile( 2025-05-07T20:32:37.0893298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0893508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0893642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0893649Z 2025-05-07T20:32:37.0893949Z self = 2025-05-07T20:32:37.0894711Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0895209Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd03ce0>} 2025-05-07T20:32:37.0895997Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0896193Z context = 2025-05-07T20:32:37.0896197Z 2025-05-07T20:32:37.0896359Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0896621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0896727Z module_map=module_map) 2025-05-07T20:32:37.0896887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0896987Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0897063Z E ^ 2025-05-07T20:32:37.0897412Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0897462Z 2025-05-07T20:32:37.0897872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0897876Z 2025-05-07T20:32:37.0897980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0898427Z self=, 2025-05-07T20:32:37.0898542Z T=128, 2025-05-07T20:32:37.0898767Z D=7168, 2025-05-07T20:32:37.0898857Z scale_ub=1200.0, 2025-05-07T20:32:37.0898943Z contiguous=False, 2025-05-07T20:32:37.0899024Z compiled=False, 2025-05-07T20:32:37.0899103Z ) 2025-05-07T20:32:37.0899318Z self = 2025-05-07T20:32:37.0899488Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.0899493Z 2025-05-07T20:32:37.0899573Z @given( 2025-05-07T20:32:37.0899697Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0899802Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0899914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0900099Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0900221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0900294Z ) 2025-05-07T20:32:37.0900537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0900637Z def test_silu_mul_quant( 2025-05-07T20:32:37.0900711Z self, 2025-05-07T20:32:37.0900788Z T: int, 2025-05-07T20:32:37.0900873Z D: int, 2025-05-07T20:32:37.0900972Z scale_ub: Optional[float], 2025-05-07T20:32:37.0901067Z contiguous: bool, 2025-05-07T20:32:37.0901154Z compiled: bool, 2025-05-07T20:32:37.0901233Z ) -> None: 2025-05-07T20:32:37.0901337Z torch.manual_seed(2025) 2025-05-07T20:32:37.0901413Z 2025-05-07T20:32:37.0901578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0901657Z 2025-05-07T20:32:37.0901750Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0901875Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0901971Z x = x_sign * x_clamp 2025-05-07T20:32:37.0902114Z x0 = x[:, :D] 2025-05-07T20:32:37.0902195Z x1 = x[:, D:] 2025-05-07T20:32:37.0902273Z 2025-05-07T20:32:37.0902354Z if contiguous: 2025-05-07T20:32:37.0902448Z x0 = x0.contiguous() 2025-05-07T20:32:37.0902543Z x1 = x1.contiguous() 2025-05-07T20:32:37.0902616Z 2025-05-07T20:32:37.0902712Z if scale_ub is not None: 2025-05-07T20:32:37.0902816Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0902951Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0903029Z ) 2025-05-07T20:32:37.0903105Z else: 2025-05-07T20:32:37.0903201Z scale_ub_tensor = None 2025-05-07T20:32:37.0903278Z 2025-05-07T20:32:37.0903407Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0903496Z op = silu_mul_quant 2025-05-07T20:32:37.0903582Z if compiled: 2025-05-07T20:32:37.0903683Z op = torch.compile(op) 2025-05-07T20:32:37.0903791Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0903869Z 2025-05-07T20:32:37.0903961Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0903968Z 2025-05-07T20:32:37.0904069Z moe/activation_test.py:117: 2025-05-07T20:32:37.0904198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0904297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0904397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0904887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0905049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0905433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0905679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0906023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0906116Z kernel = self.compile( 2025-05-07T20:32:37.0906493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0906713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0906841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0906846Z 2025-05-07T20:32:37.0907052Z self = 2025-05-07T20:32:37.0907821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0908357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf7bc180>} 2025-05-07T20:32:37.0909099Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0909291Z context = 2025-05-07T20:32:37.0909296Z 2025-05-07T20:32:37.0909464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0909720Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0909829Z module_map=module_map) 2025-05-07T20:32:37.0909995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0910094Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0910179Z E ^ 2025-05-07T20:32:37.0910571Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0910577Z 2025-05-07T20:32:37.0910979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0910986Z 2025-05-07T20:32:37.0911091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0911309Z self=, 2025-05-07T20:32:37.0911386Z T=128, 2025-05-07T20:32:37.0911466Z D=5120, 2025-05-07T20:32:37.0911547Z scale_ub=None, 2025-05-07T20:32:37.0911637Z contiguous=False, 2025-05-07T20:32:37.0911723Z compiled=False, 2025-05-07T20:32:37.0911796Z ) 2025-05-07T20:32:37.0912016Z self = 2025-05-07T20:32:37.0912185Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.0912190Z 2025-05-07T20:32:37.0912270Z @given( 2025-05-07T20:32:37.0912395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0912493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0912608Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0912733Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0912846Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0912925Z ) 2025-05-07T20:32:37.0913166Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0913258Z def test_silu_mul_quant( 2025-05-07T20:32:37.0913340Z self, 2025-05-07T20:32:37.0913467Z T: int, 2025-05-07T20:32:37.0913542Z D: int, 2025-05-07T20:32:37.0913644Z scale_ub: Optional[float], 2025-05-07T20:32:37.0913734Z contiguous: bool, 2025-05-07T20:32:37.0913819Z compiled: bool, 2025-05-07T20:32:37.0913902Z ) -> None: 2025-05-07T20:32:37.0913997Z torch.manual_seed(2025) 2025-05-07T20:32:37.0914070Z 2025-05-07T20:32:37.0914247Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0914319Z 2025-05-07T20:32:37.0914413Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0914586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0914680Z x = x_sign * x_clamp 2025-05-07T20:32:37.0914761Z x0 = x[:, :D] 2025-05-07T20:32:37.0914840Z x1 = x[:, D:] 2025-05-07T20:32:37.0914919Z 2025-05-07T20:32:37.0915002Z if contiguous: 2025-05-07T20:32:37.0915100Z x0 = x0.contiguous() 2025-05-07T20:32:37.0915190Z x1 = x1.contiguous() 2025-05-07T20:32:37.0915265Z 2025-05-07T20:32:37.0915361Z if scale_ub is not None: 2025-05-07T20:32:37.0915466Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0915601Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0915682Z ) 2025-05-07T20:32:37.0915799Z else: 2025-05-07T20:32:37.0915899Z scale_ub_tensor = None 2025-05-07T20:32:37.0915975Z 2025-05-07T20:32:37.0916103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0916194Z op = silu_mul_quant 2025-05-07T20:32:37.0916283Z if compiled: 2025-05-07T20:32:37.0916382Z op = torch.compile(op) 2025-05-07T20:32:37.0916493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0916565Z 2025-05-07T20:32:37.0916655Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0916659Z 2025-05-07T20:32:37.0916759Z moe/activation_test.py:117: 2025-05-07T20:32:37.0916886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0916989Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0917092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0917584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0917724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0918080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0918301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0918644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0918738Z kernel = self.compile( 2025-05-07T20:32:37.0919120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0919303Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0919430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0919434Z 2025-05-07T20:32:37.0919641Z self = 2025-05-07T20:32:37.0920407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0920904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dd18180>} 2025-05-07T20:32:37.0921640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0921870Z context = 2025-05-07T20:32:37.0921874Z 2025-05-07T20:32:37.0922042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0922302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0922412Z module_map=module_map) 2025-05-07T20:32:37.0922575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0922739Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0922819Z E ^ 2025-05-07T20:32:37.0923167Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0923172Z 2025-05-07T20:32:37.0923576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0923580Z 2025-05-07T20:32:37.0923688Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0923907Z self=, 2025-05-07T20:32:37.0923986Z T=128, 2025-05-07T20:32:37.0924063Z D=5120, 2025-05-07T20:32:37.0924146Z scale_ub=1200.0, 2025-05-07T20:32:37.0924272Z contiguous=True, 2025-05-07T20:32:37.0924357Z compiled=False, 2025-05-07T20:32:37.0924432Z ) 2025-05-07T20:32:37.0924651Z self = 2025-05-07T20:32:37.0924824Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.0924829Z 2025-05-07T20:32:37.0924905Z @given( 2025-05-07T20:32:37.0925028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0925127Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0925249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0925367Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0925501Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0925586Z ) 2025-05-07T20:32:37.0925850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0925943Z def test_silu_mul_quant( 2025-05-07T20:32:37.0926024Z self, 2025-05-07T20:32:37.0926102Z T: int, 2025-05-07T20:32:37.0926220Z D: int, 2025-05-07T20:32:37.0926326Z scale_ub: Optional[float], 2025-05-07T20:32:37.0926420Z contiguous: bool, 2025-05-07T20:32:37.0926506Z compiled: bool, 2025-05-07T20:32:37.0926589Z ) -> None: 2025-05-07T20:32:37.0926683Z torch.manual_seed(2025) 2025-05-07T20:32:37.0926758Z 2025-05-07T20:32:37.0926922Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0926995Z 2025-05-07T20:32:37.0927088Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0927213Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0927304Z x = x_sign * x_clamp 2025-05-07T20:32:37.0927388Z x0 = x[:, :D] 2025-05-07T20:32:37.0927466Z x1 = x[:, D:] 2025-05-07T20:32:37.0927537Z 2025-05-07T20:32:37.0927626Z if contiguous: 2025-05-07T20:32:37.0927715Z x0 = x0.contiguous() 2025-05-07T20:32:37.0927807Z x1 = x1.contiguous() 2025-05-07T20:32:37.0927884Z 2025-05-07T20:32:37.0927977Z if scale_ub is not None: 2025-05-07T20:32:37.0928084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0928222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0928301Z ) 2025-05-07T20:32:37.0928383Z else: 2025-05-07T20:32:37.0928475Z scale_ub_tensor = None 2025-05-07T20:32:37.0928547Z 2025-05-07T20:32:37.0928680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0928770Z op = silu_mul_quant 2025-05-07T20:32:37.0928854Z if compiled: 2025-05-07T20:32:37.0928958Z op = torch.compile(op) 2025-05-07T20:32:37.0929108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0929181Z 2025-05-07T20:32:37.0929276Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0929281Z 2025-05-07T20:32:37.0929377Z moe/activation_test.py:117: 2025-05-07T20:32:37.0929512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0929611Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0929709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0930250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0930348Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0930704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0930926Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0931262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0931356Z kernel = self.compile( 2025-05-07T20:32:37.0931775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0931952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0932083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0932090Z 2025-05-07T20:32:37.0932292Z self = 2025-05-07T20:32:37.0933060Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0933556Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dae8c20>} 2025-05-07T20:32:37.0934410Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0934644Z context = 2025-05-07T20:32:37.0934649Z 2025-05-07T20:32:37.0934814Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0935079Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0935190Z module_map=module_map) 2025-05-07T20:32:37.0935352Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0935476Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0935557Z E ^ 2025-05-07T20:32:37.0935924Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0935936Z 2025-05-07T20:32:37.0936340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0936347Z 2025-05-07T20:32:37.0936448Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0936674Z self=, 2025-05-07T20:32:37.0936748Z T=1, 2025-05-07T20:32:37.0936825Z D=7168, 2025-05-07T20:32:37.0936914Z scale_ub=1200.0, 2025-05-07T20:32:37.0936996Z contiguous=True, 2025-05-07T20:32:37.0937076Z compiled=True, 2025-05-07T20:32:37.0937151Z ) 2025-05-07T20:32:37.0937364Z self = 2025-05-07T20:32:37.0937527Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.0937532Z 2025-05-07T20:32:37.0937650Z @given( 2025-05-07T20:32:37.0937769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0937870Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0937985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0938104Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0938226Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0938299Z ) 2025-05-07T20:32:37.0938542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0938677Z def test_silu_mul_quant( 2025-05-07T20:32:37.0938752Z self, 2025-05-07T20:32:37.0938835Z T: int, 2025-05-07T20:32:37.0938912Z D: int, 2025-05-07T20:32:37.0939010Z scale_ub: Optional[float], 2025-05-07T20:32:37.0939102Z contiguous: bool, 2025-05-07T20:32:37.0939187Z compiled: bool, 2025-05-07T20:32:37.0939266Z ) -> None: 2025-05-07T20:32:37.0939365Z torch.manual_seed(2025) 2025-05-07T20:32:37.0939440Z 2025-05-07T20:32:37.0939605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0939683Z 2025-05-07T20:32:37.0939775Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0939902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0940053Z x = x_sign * x_clamp 2025-05-07T20:32:37.0940136Z x0 = x[:, :D] 2025-05-07T20:32:37.0940220Z x1 = x[:, D:] 2025-05-07T20:32:37.0940290Z 2025-05-07T20:32:37.0940373Z if contiguous: 2025-05-07T20:32:37.0940470Z x0 = x0.contiguous() 2025-05-07T20:32:37.0940557Z x1 = x1.contiguous() 2025-05-07T20:32:37.0940628Z 2025-05-07T20:32:37.0940720Z if scale_ub is not None: 2025-05-07T20:32:37.0940823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0940956Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0941036Z ) 2025-05-07T20:32:37.0941111Z else: 2025-05-07T20:32:37.0941208Z scale_ub_tensor = None 2025-05-07T20:32:37.0941284Z 2025-05-07T20:32:37.0941411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0941505Z op = silu_mul_quant 2025-05-07T20:32:37.0941591Z if compiled: 2025-05-07T20:32:37.0941692Z op = torch.compile(op) 2025-05-07T20:32:37.0941844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0941918Z 2025-05-07T20:32:37.0942008Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0942015Z 2025-05-07T20:32:37.0942120Z moe/activation_test.py:117: 2025-05-07T20:32:37.0942247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0942344Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0942443Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0942805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.0942904Z return fn(*args, **kwargs) 2025-05-07T20:32:37.0943390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0943485Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0943850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0944068Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0944406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0944505Z kernel = self.compile( 2025-05-07T20:32:37.0944882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0945055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0945181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0945229Z 2025-05-07T20:32:37.0945433Z self = 2025-05-07T20:32:37.0946207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0946700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dae9ee0>} 2025-05-07T20:32:37.0947474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0947663Z context = 2025-05-07T20:32:37.0947670Z 2025-05-07T20:32:37.0947840Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0948100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0948207Z module_map=module_map) 2025-05-07T20:32:37.0948411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0948510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0948587Z E ^ 2025-05-07T20:32:37.0948940Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0948948Z 2025-05-07T20:32:37.0949349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0949354Z 2025-05-07T20:32:37.0949461Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0949681Z self=, 2025-05-07T20:32:37.0949760Z T=1, 2025-05-07T20:32:37.0949840Z D=7168, 2025-05-07T20:32:37.0949922Z scale_ub=1200.0, 2025-05-07T20:32:37.0950007Z contiguous=False, 2025-05-07T20:32:37.0950093Z compiled=True, 2025-05-07T20:32:37.0950165Z ) 2025-05-07T20:32:37.0950383Z self = 2025-05-07T20:32:37.0950589Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.0950594Z 2025-05-07T20:32:37.0950670Z @given( 2025-05-07T20:32:37.0950795Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0950893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0951009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0951130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0951243Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0951316Z ) 2025-05-07T20:32:37.0951560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0951653Z def test_silu_mul_quant( 2025-05-07T20:32:37.0951738Z self, 2025-05-07T20:32:37.0951814Z T: int, 2025-05-07T20:32:37.0951890Z D: int, 2025-05-07T20:32:37.0951993Z scale_ub: Optional[float], 2025-05-07T20:32:37.0952083Z contiguous: bool, 2025-05-07T20:32:37.0952170Z compiled: bool, 2025-05-07T20:32:37.0952252Z ) -> None: 2025-05-07T20:32:37.0952345Z torch.manual_seed(2025) 2025-05-07T20:32:37.0952419Z 2025-05-07T20:32:37.0952589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0952661Z 2025-05-07T20:32:37.0952750Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0952882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0952967Z x = x_sign * x_clamp 2025-05-07T20:32:37.0953047Z x0 = x[:, :D] 2025-05-07T20:32:37.0953130Z x1 = x[:, D:] 2025-05-07T20:32:37.0953271Z 2025-05-07T20:32:37.0953358Z if contiguous: 2025-05-07T20:32:37.0953447Z x0 = x0.contiguous() 2025-05-07T20:32:37.0953535Z x1 = x1.contiguous() 2025-05-07T20:32:37.0953612Z 2025-05-07T20:32:37.0953703Z if scale_ub is not None: 2025-05-07T20:32:37.0953810Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0953958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0954034Z ) 2025-05-07T20:32:37.0954110Z else: 2025-05-07T20:32:37.0954251Z scale_ub_tensor = None 2025-05-07T20:32:37.0954321Z 2025-05-07T20:32:37.0954449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0954543Z op = silu_mul_quant 2025-05-07T20:32:37.0954625Z if compiled: 2025-05-07T20:32:37.0954728Z op = torch.compile(op) 2025-05-07T20:32:37.0954832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0954904Z 2025-05-07T20:32:37.0955001Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0955005Z 2025-05-07T20:32:37.0955103Z moe/activation_test.py:117: 2025-05-07T20:32:37.0955233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0955352Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0955510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0955891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.0955990Z return fn(*args, **kwargs) 2025-05-07T20:32:37.0956474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0956575Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0956929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0957148Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0957489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0957581Z kernel = self.compile( 2025-05-07T20:32:37.0957965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0958180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0958309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0958315Z 2025-05-07T20:32:37.0958523Z self = 2025-05-07T20:32:37.0959285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0959790Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779daeac00>} 2025-05-07T20:32:37.0960525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0961116Z context = 2025-05-07T20:32:37.0961128Z 2025-05-07T20:32:37.0961291Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0961546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0961658Z module_map=module_map) 2025-05-07T20:32:37.0961818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0961916Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0962043Z E ^ 2025-05-07T20:32:37.0962391Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0962395Z 2025-05-07T20:32:37.0962811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0962816Z 2025-05-07T20:32:37.0962921Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0963139Z self=, 2025-05-07T20:32:37.0963262Z T=1, 2025-05-07T20:32:37.0963337Z D=7168, 2025-05-07T20:32:37.0963418Z scale_ub=None, 2025-05-07T20:32:37.0963507Z contiguous=False, 2025-05-07T20:32:37.0963590Z compiled=True, 2025-05-07T20:32:37.0963663Z ) 2025-05-07T20:32:37.0963879Z self = 2025-05-07T20:32:37.0964040Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.0964047Z 2025-05-07T20:32:37.0964128Z @given( 2025-05-07T20:32:37.0964246Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0964345Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0964461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0964620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0964738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0964816Z ) 2025-05-07T20:32:37.0965056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0965151Z def test_silu_mul_quant( 2025-05-07T20:32:37.0965236Z self, 2025-05-07T20:32:37.0965331Z T: int, 2025-05-07T20:32:37.0965416Z D: int, 2025-05-07T20:32:37.0965536Z scale_ub: Optional[float], 2025-05-07T20:32:37.0965623Z contiguous: bool, 2025-05-07T20:32:37.0965715Z compiled: bool, 2025-05-07T20:32:37.0965793Z ) -> None: 2025-05-07T20:32:37.0965890Z torch.manual_seed(2025) 2025-05-07T20:32:37.0965965Z 2025-05-07T20:32:37.0966131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0966205Z 2025-05-07T20:32:37.0966298Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0966423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0966555Z x = x_sign * x_clamp 2025-05-07T20:32:37.0966641Z x0 = x[:, :D] 2025-05-07T20:32:37.0966720Z x1 = x[:, D:] 2025-05-07T20:32:37.0966797Z 2025-05-07T20:32:37.0966879Z if contiguous: 2025-05-07T20:32:37.0966970Z x0 = x0.contiguous() 2025-05-07T20:32:37.0967064Z x1 = x1.contiguous() 2025-05-07T20:32:37.0967136Z 2025-05-07T20:32:37.0967226Z if scale_ub is not None: 2025-05-07T20:32:37.0967333Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0967466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0967545Z ) 2025-05-07T20:32:37.0967622Z else: 2025-05-07T20:32:37.0967716Z scale_ub_tensor = None 2025-05-07T20:32:37.0967789Z 2025-05-07T20:32:37.0967921Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0968009Z op = silu_mul_quant 2025-05-07T20:32:37.0968096Z if compiled: 2025-05-07T20:32:37.0968202Z op = torch.compile(op) 2025-05-07T20:32:37.0968306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0968381Z 2025-05-07T20:32:37.0968473Z y_fp8, y_scale = fn() 2025-05-07T20:32:37.0968593Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:37.0968667Z 2025-05-07T20:32:37.0968801Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0968902Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:37.0969003Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:37.0969124Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:37.0969309Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0969385Z 2025-05-07T20:32:37.0969485Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:37.0969490Z 2025-05-07T20:32:37.0969588Z moe/activation_test.py:126: 2025-05-07T20:32:37.0969720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0969825Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:37.0969960Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:37.0970546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:37.0970646Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:37.0971006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0971225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0971590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:37.0971842Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:37.0972255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:37.0976988Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:37.0977362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:37.0977446Z fn() 2025-05-07T20:32:37.0977850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:37.0977934Z self.fn.run( 2025-05-07T20:32:37.0978279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0978376Z kernel = self.compile( 2025-05-07T20:32:37.0978759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0978936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0979134Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0979140Z 2025-05-07T20:32:37.0979350Z self = 2025-05-07T20:32:37.0980117Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0980617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc28180>} 2025-05-07T20:32:37.0981351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0981542Z context = 2025-05-07T20:32:37.0981551Z 2025-05-07T20:32:37.0981723Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0981981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0982095Z module_map=module_map) 2025-05-07T20:32:37.0982260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0982365Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:37.0982445Z E ^ 2025-05-07T20:32:37.0982794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0982842Z 2025-05-07T20:32:37.0983256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0983260Z 2025-05-07T20:32:37.0983363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0983589Z self=, 2025-05-07T20:32:37.0983673Z T=1, 2025-05-07T20:32:37.0983750Z D=5120, 2025-05-07T20:32:37.0983835Z scale_ub=1200.0, 2025-05-07T20:32:37.0983968Z contiguous=False, 2025-05-07T20:32:37.0984052Z compiled=True, 2025-05-07T20:32:37.0984126Z ) 2025-05-07T20:32:37.0984346Z self = 2025-05-07T20:32:37.0984509Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.0984514Z 2025-05-07T20:32:37.0984597Z @given( 2025-05-07T20:32:37.0984714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0984817Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0984934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0985050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0985163Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0985286Z ) 2025-05-07T20:32:37.0985533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0985629Z def test_silu_mul_quant( 2025-05-07T20:32:37.0985703Z self, 2025-05-07T20:32:37.0985783Z T: int, 2025-05-07T20:32:37.0985863Z D: int, 2025-05-07T20:32:37.0985962Z scale_ub: Optional[float], 2025-05-07T20:32:37.0986049Z contiguous: bool, 2025-05-07T20:32:37.0986134Z compiled: bool, 2025-05-07T20:32:37.0986212Z ) -> None: 2025-05-07T20:32:37.0986305Z torch.manual_seed(2025) 2025-05-07T20:32:37.0986382Z 2025-05-07T20:32:37.0986549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.0986625Z 2025-05-07T20:32:37.0986719Z x_sign = torch.sign(x) 2025-05-07T20:32:37.0986844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.0986935Z x = x_sign * x_clamp 2025-05-07T20:32:37.0987017Z x0 = x[:, :D] 2025-05-07T20:32:37.0987098Z x1 = x[:, D:] 2025-05-07T20:32:37.0987243Z 2025-05-07T20:32:37.0987328Z if contiguous: 2025-05-07T20:32:37.0987417Z x0 = x0.contiguous() 2025-05-07T20:32:37.0987515Z x1 = x1.contiguous() 2025-05-07T20:32:37.0987595Z 2025-05-07T20:32:37.0987684Z if scale_ub is not None: 2025-05-07T20:32:37.0987793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.0987927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.0988002Z ) 2025-05-07T20:32:37.0988086Z else: 2025-05-07T20:32:37.0988179Z scale_ub_tensor = None 2025-05-07T20:32:37.0988255Z 2025-05-07T20:32:37.0988392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.0988481Z op = silu_mul_quant 2025-05-07T20:32:37.0988571Z if compiled: 2025-05-07T20:32:37.0988670Z op = torch.compile(op) 2025-05-07T20:32:37.0988779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0988855Z 2025-05-07T20:32:37.0988948Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.0988953Z 2025-05-07T20:32:37.0989049Z moe/activation_test.py:117: 2025-05-07T20:32:37.0989187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0989286Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.0989384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.0989750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.0989844Z return fn(*args, **kwargs) 2025-05-07T20:32:37.0990335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.0990477Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.0990836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.0991064Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.0991400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.0991538Z kernel = self.compile( 2025-05-07T20:32:37.0991917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.0992095Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.0992225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.0992229Z 2025-05-07T20:32:37.0992435Z self = 2025-05-07T20:32:37.0993244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.0993744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc29300>} 2025-05-07T20:32:37.0994481Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.0994667Z context = 2025-05-07T20:32:37.0994672Z 2025-05-07T20:32:37.0994835Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.0995099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.0995204Z module_map=module_map) 2025-05-07T20:32:37.0995394Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.0995507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.0995643Z E ^ 2025-05-07T20:32:37.0996001Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.0996009Z 2025-05-07T20:32:37.0996413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.0996418Z 2025-05-07T20:32:37.0996523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.0996744Z self=, 2025-05-07T20:32:37.0996820Z T=1, 2025-05-07T20:32:37.0996901Z D=5120, 2025-05-07T20:32:37.0996988Z scale_ub=1200.0, 2025-05-07T20:32:37.0997074Z contiguous=False, 2025-05-07T20:32:37.0997161Z compiled=False, 2025-05-07T20:32:37.0997234Z ) 2025-05-07T20:32:37.0997449Z self = 2025-05-07T20:32:37.0997619Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.0997630Z 2025-05-07T20:32:37.0997707Z @given( 2025-05-07T20:32:37.0997830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.0997934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.0998048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.0998363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.0998530Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.0998608Z ) 2025-05-07T20:32:37.0998853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.0999038Z def test_silu_mul_quant( 2025-05-07T20:32:37.0999116Z self, 2025-05-07T20:32:37.0999199Z T: int, 2025-05-07T20:32:37.0999274Z D: int, 2025-05-07T20:32:37.0999372Z scale_ub: Optional[float], 2025-05-07T20:32:37.0999462Z contiguous: bool, 2025-05-07T20:32:37.0999551Z compiled: bool, 2025-05-07T20:32:37.0999633Z ) -> None: 2025-05-07T20:32:37.0999728Z torch.manual_seed(2025) 2025-05-07T20:32:37.0999800Z 2025-05-07T20:32:37.0999971Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1000109Z 2025-05-07T20:32:37.1000200Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1000326Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1000412Z x = x_sign * x_clamp 2025-05-07T20:32:37.1000492Z x0 = x[:, :D] 2025-05-07T20:32:37.1000575Z x1 = x[:, D:] 2025-05-07T20:32:37.1000647Z 2025-05-07T20:32:37.1000729Z if contiguous: 2025-05-07T20:32:37.1000826Z x0 = x0.contiguous() 2025-05-07T20:32:37.1000915Z x1 = x1.contiguous() 2025-05-07T20:32:37.1000987Z 2025-05-07T20:32:37.1001081Z if scale_ub is not None: 2025-05-07T20:32:37.1001185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1001388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1001468Z ) 2025-05-07T20:32:37.1001544Z else: 2025-05-07T20:32:37.1001642Z scale_ub_tensor = None 2025-05-07T20:32:37.1001712Z 2025-05-07T20:32:37.1001842Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1001937Z op = silu_mul_quant 2025-05-07T20:32:37.1002024Z if compiled: 2025-05-07T20:32:37.1002123Z op = torch.compile(op) 2025-05-07T20:32:37.1002230Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1002300Z 2025-05-07T20:32:37.1002390Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1002398Z 2025-05-07T20:32:37.1002497Z moe/activation_test.py:117: 2025-05-07T20:32:37.1002628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1002728Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1002826Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1003379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1003482Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1003840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1004062Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1004398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1004490Z kernel = self.compile( 2025-05-07T20:32:37.1004870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1005047Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1005176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1005184Z 2025-05-07T20:32:37.1005408Z self = 2025-05-07T20:32:37.1006198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1006702Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc2a020>} 2025-05-07T20:32:37.1007434Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1007668Z context = 2025-05-07T20:32:37.1007673Z 2025-05-07T20:32:37.1007837Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1008097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1008212Z module_map=module_map) 2025-05-07T20:32:37.1008412Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1008509Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1008590Z E ^ 2025-05-07T20:32:37.1008936Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1008941Z 2025-05-07T20:32:37.1009351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1009358Z 2025-05-07T20:32:37.1009458Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1009677Z self=, 2025-05-07T20:32:37.1009759Z T=16384, 2025-05-07T20:32:37.1009873Z D=5120, 2025-05-07T20:32:37.1009957Z scale_ub=1200.0, 2025-05-07T20:32:37.1010052Z contiguous=False, 2025-05-07T20:32:37.1010136Z compiled=True, 2025-05-07T20:32:37.1010214Z ) 2025-05-07T20:32:37.1010433Z self = 2025-05-07T20:32:37.1010609Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1010613Z 2025-05-07T20:32:37.1010692Z @given( 2025-05-07T20:32:37.1010812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1010911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1011028Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1011146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1011264Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1011338Z ) 2025-05-07T20:32:37.1011581Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1011676Z def test_silu_mul_quant( 2025-05-07T20:32:37.1011794Z self, 2025-05-07T20:32:37.1011871Z T: int, 2025-05-07T20:32:37.1011951Z D: int, 2025-05-07T20:32:37.1012048Z scale_ub: Optional[float], 2025-05-07T20:32:37.1012142Z contiguous: bool, 2025-05-07T20:32:37.1012232Z compiled: bool, 2025-05-07T20:32:37.1012309Z ) -> None: 2025-05-07T20:32:37.1012402Z torch.manual_seed(2025) 2025-05-07T20:32:37.1012476Z 2025-05-07T20:32:37.1012641Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1012715Z 2025-05-07T20:32:37.1012814Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1012942Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1013034Z x = x_sign * x_clamp 2025-05-07T20:32:37.1013115Z x0 = x[:, :D] 2025-05-07T20:32:37.1013195Z x1 = x[:, D:] 2025-05-07T20:32:37.1013267Z 2025-05-07T20:32:37.1013354Z if contiguous: 2025-05-07T20:32:37.1013447Z x0 = x0.contiguous() 2025-05-07T20:32:37.1013544Z x1 = x1.contiguous() 2025-05-07T20:32:37.1013615Z 2025-05-07T20:32:37.1013810Z if scale_ub is not None: 2025-05-07T20:32:37.1013926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1014060Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1014134Z ) 2025-05-07T20:32:37.1014213Z else: 2025-05-07T20:32:37.1014305Z scale_ub_tensor = None 2025-05-07T20:32:37.1014382Z 2025-05-07T20:32:37.1014510Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1014598Z op = silu_mul_quant 2025-05-07T20:32:37.1014734Z if compiled: 2025-05-07T20:32:37.1014833Z op = torch.compile(op) 2025-05-07T20:32:37.1014940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1015017Z 2025-05-07T20:32:37.1015110Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1015116Z 2025-05-07T20:32:37.1015212Z moe/activation_test.py:117: 2025-05-07T20:32:37.1015346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1015449Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1015611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1015972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1016065Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1016560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1016657Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1017013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1017234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1017634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1017731Z kernel = self.compile( 2025-05-07T20:32:37.1018108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1018284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1018416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1018421Z 2025-05-07T20:32:37.1018622Z self = 2025-05-07T20:32:37.1019387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1019925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779dc2b600>} 2025-05-07T20:32:37.1020657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1020855Z context = 2025-05-07T20:32:37.1020859Z 2025-05-07T20:32:37.1021020Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1021281Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1021390Z module_map=module_map) 2025-05-07T20:32:37.1021549Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1021648Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1021724Z E ^ 2025-05-07T20:32:37.1022077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1022090Z 2025-05-07T20:32:37.1022498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1022505Z 2025-05-07T20:32:37.1022606Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1022827Z self=, 2025-05-07T20:32:37.1022903Z T=2048, 2025-05-07T20:32:37.1022978Z D=7168, 2025-05-07T20:32:37.1023066Z scale_ub=1200.0, 2025-05-07T20:32:37.1023152Z contiguous=False, 2025-05-07T20:32:37.1023276Z compiled=True, 2025-05-07T20:32:37.1023350Z ) 2025-05-07T20:32:37.1023566Z self = 2025-05-07T20:32:37.1023740Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1023745Z 2025-05-07T20:32:37.1023823Z @given( 2025-05-07T20:32:37.1023945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1024045Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1024157Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1024314Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1024435Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1024510Z ) 2025-05-07T20:32:37.1024756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1024847Z def test_silu_mul_quant( 2025-05-07T20:32:37.1024925Z self, 2025-05-07T20:32:37.1025004Z T: int, 2025-05-07T20:32:37.1025082Z D: int, 2025-05-07T20:32:37.1025178Z scale_ub: Optional[float], 2025-05-07T20:32:37.1025271Z contiguous: bool, 2025-05-07T20:32:37.1025356Z compiled: bool, 2025-05-07T20:32:37.1025433Z ) -> None: 2025-05-07T20:32:37.1025530Z torch.manual_seed(2025) 2025-05-07T20:32:37.1025642Z 2025-05-07T20:32:37.1025810Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1025891Z 2025-05-07T20:32:37.1025980Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1026105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1026196Z x = x_sign * x_clamp 2025-05-07T20:32:37.1026276Z x0 = x[:, :D] 2025-05-07T20:32:37.1026359Z x1 = x[:, D:] 2025-05-07T20:32:37.1026429Z 2025-05-07T20:32:37.1026510Z if contiguous: 2025-05-07T20:32:37.1026602Z x0 = x0.contiguous() 2025-05-07T20:32:37.1026692Z x1 = x1.contiguous() 2025-05-07T20:32:37.1026766Z 2025-05-07T20:32:37.1026860Z if scale_ub is not None: 2025-05-07T20:32:37.1026965Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1027099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1027175Z ) 2025-05-07T20:32:37.1027253Z else: 2025-05-07T20:32:37.1027345Z scale_ub_tensor = None 2025-05-07T20:32:37.1027461Z 2025-05-07T20:32:37.1027591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1027685Z op = silu_mul_quant 2025-05-07T20:32:37.1027770Z if compiled: 2025-05-07T20:32:37.1027869Z op = torch.compile(op) 2025-05-07T20:32:37.1027975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1028045Z 2025-05-07T20:32:37.1028134Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1028138Z 2025-05-07T20:32:37.1028237Z moe/activation_test.py:117: 2025-05-07T20:32:37.1028364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1028466Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1028568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1028927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1029023Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1029512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1029611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1029971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1030188Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1030522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1030620Z kernel = self.compile( 2025-05-07T20:32:37.1031084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1031261Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1031389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1031395Z 2025-05-07T20:32:37.1031596Z self = 2025-05-07T20:32:37.1032365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1032898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d608720>} 2025-05-07T20:32:37.1033637Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1033826Z context = 2025-05-07T20:32:37.1033869Z 2025-05-07T20:32:37.1034037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1034295Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1034405Z module_map=module_map) 2025-05-07T20:32:37.1034567Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1034663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1034738Z E ^ 2025-05-07T20:32:37.1035089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1035093Z 2025-05-07T20:32:37.1035524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1035528Z 2025-05-07T20:32:37.1035650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1035879Z self=, 2025-05-07T20:32:37.1035954Z T=1, 2025-05-07T20:32:37.1036072Z D=5120, 2025-05-07T20:32:37.1036154Z scale_ub=None, 2025-05-07T20:32:37.1036241Z contiguous=False, 2025-05-07T20:32:37.1036329Z compiled=False, 2025-05-07T20:32:37.1036402Z ) 2025-05-07T20:32:37.1036617Z self = 2025-05-07T20:32:37.1036784Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1036789Z 2025-05-07T20:32:37.1036863Z @given( 2025-05-07T20:32:37.1036985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1037085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1037200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1037320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1037434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1037508Z ) 2025-05-07T20:32:37.1037756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1037849Z def test_silu_mul_quant( 2025-05-07T20:32:37.1037928Z self, 2025-05-07T20:32:37.1038007Z T: int, 2025-05-07T20:32:37.1038085Z D: int, 2025-05-07T20:32:37.1038183Z scale_ub: Optional[float], 2025-05-07T20:32:37.1038274Z contiguous: bool, 2025-05-07T20:32:37.1038361Z compiled: bool, 2025-05-07T20:32:37.1038438Z ) -> None: 2025-05-07T20:32:37.1038536Z torch.manual_seed(2025) 2025-05-07T20:32:37.1038609Z 2025-05-07T20:32:37.1038774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1038893Z 2025-05-07T20:32:37.1038983Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1039105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1039196Z x = x_sign * x_clamp 2025-05-07T20:32:37.1039275Z x0 = x[:, :D] 2025-05-07T20:32:37.1039356Z x1 = x[:, D:] 2025-05-07T20:32:37.1039433Z 2025-05-07T20:32:37.1039515Z if contiguous: 2025-05-07T20:32:37.1039605Z x0 = x0.contiguous() 2025-05-07T20:32:37.1039697Z x1 = x1.contiguous() 2025-05-07T20:32:37.1039767Z 2025-05-07T20:32:37.1039901Z if scale_ub is not None: 2025-05-07T20:32:37.1040009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1040142Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1040221Z ) 2025-05-07T20:32:37.1040295Z else: 2025-05-07T20:32:37.1040387Z scale_ub_tensor = None 2025-05-07T20:32:37.1040461Z 2025-05-07T20:32:37.1040587Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1040679Z op = silu_mul_quant 2025-05-07T20:32:37.1040766Z if compiled: 2025-05-07T20:32:37.1040863Z op = torch.compile(op) 2025-05-07T20:32:37.1040968Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1041041Z 2025-05-07T20:32:37.1041173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1041180Z 2025-05-07T20:32:37.1041281Z moe/activation_test.py:117: 2025-05-07T20:32:37.1041410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1041510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1041613Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1042102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1042199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1042559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1042780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1043118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1043213Z kernel = self.compile( 2025-05-07T20:32:37.1043633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1043808Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1043937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1043941Z 2025-05-07T20:32:37.1044143Z self = 2025-05-07T20:32:37.1044907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1045404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d609120>} 2025-05-07T20:32:37.1046192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1046381Z context = 2025-05-07T20:32:37.1046385Z 2025-05-07T20:32:37.1046550Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1046806Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1046912Z module_map=module_map) 2025-05-07T20:32:37.1047076Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1047213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1047288Z E ^ 2025-05-07T20:32:37.1047638Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1047645Z 2025-05-07T20:32:37.1048054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1048058Z 2025-05-07T20:32:37.1048165Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1048449Z self=, 2025-05-07T20:32:37.1048526Z T=4096, 2025-05-07T20:32:37.1048607Z D=7168, 2025-05-07T20:32:37.1048690Z scale_ub=1200.0, 2025-05-07T20:32:37.1048779Z contiguous=False, 2025-05-07T20:32:37.1048867Z compiled=False, 2025-05-07T20:32:37.1048939Z ) 2025-05-07T20:32:37.1049156Z self = 2025-05-07T20:32:37.1049331Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1049335Z 2025-05-07T20:32:37.1049413Z @given( 2025-05-07T20:32:37.1049536Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1049675Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1049792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1049912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1050025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1050105Z ) 2025-05-07T20:32:37.1050346Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1050438Z def test_silu_mul_quant( 2025-05-07T20:32:37.1050518Z self, 2025-05-07T20:32:37.1050592Z T: int, 2025-05-07T20:32:37.1050669Z D: int, 2025-05-07T20:32:37.1050767Z scale_ub: Optional[float], 2025-05-07T20:32:37.1050859Z contiguous: bool, 2025-05-07T20:32:37.1050944Z compiled: bool, 2025-05-07T20:32:37.1051027Z ) -> None: 2025-05-07T20:32:37.1051119Z torch.manual_seed(2025) 2025-05-07T20:32:37.1051190Z 2025-05-07T20:32:37.1051364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1051435Z 2025-05-07T20:32:37.1051568Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1051695Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1051781Z x = x_sign * x_clamp 2025-05-07T20:32:37.1051868Z x0 = x[:, :D] 2025-05-07T20:32:37.1051946Z x1 = x[:, D:] 2025-05-07T20:32:37.1052017Z 2025-05-07T20:32:37.1052103Z if contiguous: 2025-05-07T20:32:37.1052192Z x0 = x0.contiguous() 2025-05-07T20:32:37.1052281Z x1 = x1.contiguous() 2025-05-07T20:32:37.1052355Z 2025-05-07T20:32:37.1052445Z if scale_ub is not None: 2025-05-07T20:32:37.1052550Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1052690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1052764Z ) 2025-05-07T20:32:37.1052838Z else: 2025-05-07T20:32:37.1052934Z scale_ub_tensor = None 2025-05-07T20:32:37.1053005Z 2025-05-07T20:32:37.1053138Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1053229Z op = silu_mul_quant 2025-05-07T20:32:37.1053311Z if compiled: 2025-05-07T20:32:37.1053414Z op = torch.compile(op) 2025-05-07T20:32:37.1053522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1053593Z 2025-05-07T20:32:37.1053779Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1053784Z 2025-05-07T20:32:37.1053878Z moe/activation_test.py:117: 2025-05-07T20:32:37.1054004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1054105Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1054204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1054746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1054841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1055199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1055421Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1055753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1055884Z kernel = self.compile( 2025-05-07T20:32:37.1056264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1056438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1056567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1056574Z 2025-05-07T20:32:37.1056775Z self = 2025-05-07T20:32:37.1057576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1058073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d60a480>} 2025-05-07T20:32:37.1058805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1058995Z context = 2025-05-07T20:32:37.1059002Z 2025-05-07T20:32:37.1059162Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1059422Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1059527Z module_map=module_map) 2025-05-07T20:32:37.1059687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1059829Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1059904Z E ^ 2025-05-07T20:32:37.1060251Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1060257Z 2025-05-07T20:32:37.1060663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1060667Z 2025-05-07T20:32:37.1060768Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1060987Z self=, 2025-05-07T20:32:37.1061065Z T=16384, 2025-05-07T20:32:37.1061140Z D=7168, 2025-05-07T20:32:37.1061222Z scale_ub=None, 2025-05-07T20:32:37.1061306Z contiguous=True, 2025-05-07T20:32:37.1061388Z compiled=True, 2025-05-07T20:32:37.1061462Z ) 2025-05-07T20:32:37.1061680Z self = 2025-05-07T20:32:37.1061851Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.1061859Z 2025-05-07T20:32:37.1061933Z @given( 2025-05-07T20:32:37.1062056Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1062156Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1062270Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1062384Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1062498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1062572Z ) 2025-05-07T20:32:37.1062813Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1062951Z def test_silu_mul_quant( 2025-05-07T20:32:37.1063025Z self, 2025-05-07T20:32:37.1063101Z T: int, 2025-05-07T20:32:37.1063179Z D: int, 2025-05-07T20:32:37.1063277Z scale_ub: Optional[float], 2025-05-07T20:32:37.1063371Z contiguous: bool, 2025-05-07T20:32:37.1063457Z compiled: bool, 2025-05-07T20:32:37.1063534Z ) -> None: 2025-05-07T20:32:37.1063628Z torch.manual_seed(2025) 2025-05-07T20:32:37.1063739Z 2025-05-07T20:32:37.1063902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1063976Z 2025-05-07T20:32:37.1064065Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1064186Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1064276Z x = x_sign * x_clamp 2025-05-07T20:32:37.1064354Z x0 = x[:, :D] 2025-05-07T20:32:37.1064432Z x1 = x[:, D:] 2025-05-07T20:32:37.1064510Z 2025-05-07T20:32:37.1064590Z if contiguous: 2025-05-07T20:32:37.1064679Z x0 = x0.contiguous() 2025-05-07T20:32:37.1064771Z x1 = x1.contiguous() 2025-05-07T20:32:37.1064841Z 2025-05-07T20:32:37.1064932Z if scale_ub is not None: 2025-05-07T20:32:37.1065078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1065213Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1065290Z ) 2025-05-07T20:32:37.1065365Z else: 2025-05-07T20:32:37.1065461Z scale_ub_tensor = None 2025-05-07T20:32:37.1065536Z 2025-05-07T20:32:37.1065662Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1065750Z op = silu_mul_quant 2025-05-07T20:32:37.1065835Z if compiled: 2025-05-07T20:32:37.1065932Z op = torch.compile(op) 2025-05-07T20:32:37.1066039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1066113Z 2025-05-07T20:32:37.1066205Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1066210Z 2025-05-07T20:32:37.1066308Z moe/activation_test.py:117: 2025-05-07T20:32:37.1066437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1066534Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1066637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1067041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1067135Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1067626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1067723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1068080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1068299Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1068635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1068728Z kernel = self.compile( 2025-05-07T20:32:37.1069107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1069285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1069411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1069421Z 2025-05-07T20:32:37.1069622Z self = 2025-05-07T20:32:37.1070383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1070918Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d60b740>} 2025-05-07T20:32:37.1071657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1071844Z context = 2025-05-07T20:32:37.1071848Z 2025-05-07T20:32:37.1072050Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1072308Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1072415Z module_map=module_map) 2025-05-07T20:32:37.1072577Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1072672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1072750Z E ^ 2025-05-07T20:32:37.1073098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1073102Z 2025-05-07T20:32:37.1073545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1073550Z 2025-05-07T20:32:37.1073657Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1073874Z self=, 2025-05-07T20:32:37.1073952Z T=4096, 2025-05-07T20:32:37.1074030Z D=5120, 2025-05-07T20:32:37.1074110Z scale_ub=None, 2025-05-07T20:32:37.1074197Z contiguous=False, 2025-05-07T20:32:37.1074281Z compiled=True, 2025-05-07T20:32:37.1074352Z ) 2025-05-07T20:32:37.1074567Z self = 2025-05-07T20:32:37.1074738Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1074745Z 2025-05-07T20:32:37.1074819Z @given( 2025-05-07T20:32:37.1074940Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1075039Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1075151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1075272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1075424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1075507Z ) 2025-05-07T20:32:37.1075790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1075885Z def test_silu_mul_quant( 2025-05-07T20:32:37.1075961Z self, 2025-05-07T20:32:37.1076041Z T: int, 2025-05-07T20:32:37.1076115Z D: int, 2025-05-07T20:32:37.1076211Z scale_ub: Optional[float], 2025-05-07T20:32:37.1076303Z contiguous: bool, 2025-05-07T20:32:37.1076386Z compiled: bool, 2025-05-07T20:32:37.1076469Z ) -> None: 2025-05-07T20:32:37.1076566Z torch.manual_seed(2025) 2025-05-07T20:32:37.1076638Z 2025-05-07T20:32:37.1076806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1076876Z 2025-05-07T20:32:37.1076967Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1077094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1077183Z x = x_sign * x_clamp 2025-05-07T20:32:37.1077261Z x0 = x[:, :D] 2025-05-07T20:32:37.1077348Z x1 = x[:, D:] 2025-05-07T20:32:37.1077423Z 2025-05-07T20:32:37.1077504Z if contiguous: 2025-05-07T20:32:37.1077598Z x0 = x0.contiguous() 2025-05-07T20:32:37.1077685Z x1 = x1.contiguous() 2025-05-07T20:32:37.1077755Z 2025-05-07T20:32:37.1077846Z if scale_ub is not None: 2025-05-07T20:32:37.1077950Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1078084Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1078231Z ) 2025-05-07T20:32:37.1078307Z else: 2025-05-07T20:32:37.1078403Z scale_ub_tensor = None 2025-05-07T20:32:37.1078473Z 2025-05-07T20:32:37.1078602Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1078692Z op = silu_mul_quant 2025-05-07T20:32:37.1078778Z if compiled: 2025-05-07T20:32:37.1078881Z op = torch.compile(op) 2025-05-07T20:32:37.1078988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1079059Z 2025-05-07T20:32:37.1079193Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1079202Z 2025-05-07T20:32:37.1079296Z moe/activation_test.py:117: 2025-05-07T20:32:37.1079424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1079524Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1079623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1079983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1080082Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1080568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1080705Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1081067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1081286Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1081628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1081720Z kernel = self.compile( 2025-05-07T20:32:37.1082096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1082272Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1082399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1082404Z 2025-05-07T20:32:37.1082607Z self = 2025-05-07T20:32:37.1083409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1083905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf148c20>} 2025-05-07T20:32:37.1084643Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1084831Z context = 2025-05-07T20:32:37.1084839Z 2025-05-07T20:32:37.1085004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1085259Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1085369Z module_map=module_map) 2025-05-07T20:32:37.1085560Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1085681Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1085760Z E ^ 2025-05-07T20:32:37.1086108Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1086113Z 2025-05-07T20:32:37.1086517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1086521Z 2025-05-07T20:32:37.1086626Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1086843Z self=, 2025-05-07T20:32:37.1086961Z T=4096, 2025-05-07T20:32:37.1087037Z D=5120, 2025-05-07T20:32:37.1087120Z scale_ub=1200.0, 2025-05-07T20:32:37.1087207Z contiguous=False, 2025-05-07T20:32:37.1087288Z compiled=False, 2025-05-07T20:32:37.1087362Z ) 2025-05-07T20:32:37.1087582Z self = 2025-05-07T20:32:37.1087754Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1087799Z 2025-05-07T20:32:37.1087875Z @given( 2025-05-07T20:32:37.1087995Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1088094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1088211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1088326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1088438Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1088514Z ) 2025-05-07T20:32:37.1088755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1088846Z def test_silu_mul_quant( 2025-05-07T20:32:37.1088924Z self, 2025-05-07T20:32:37.1089000Z T: int, 2025-05-07T20:32:37.1089117Z D: int, 2025-05-07T20:32:37.1089224Z scale_ub: Optional[float], 2025-05-07T20:32:37.1089315Z contiguous: bool, 2025-05-07T20:32:37.1089399Z compiled: bool, 2025-05-07T20:32:37.1089480Z ) -> None: 2025-05-07T20:32:37.1089576Z torch.manual_seed(2025) 2025-05-07T20:32:37.1089652Z 2025-05-07T20:32:37.1089817Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1089887Z 2025-05-07T20:32:37.1089982Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1090105Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1090192Z x = x_sign * x_clamp 2025-05-07T20:32:37.1090272Z x0 = x[:, :D] 2025-05-07T20:32:37.1090353Z x1 = x[:, D:] 2025-05-07T20:32:37.1090425Z 2025-05-07T20:32:37.1090510Z if contiguous: 2025-05-07T20:32:37.1090601Z x0 = x0.contiguous() 2025-05-07T20:32:37.1090688Z x1 = x1.contiguous() 2025-05-07T20:32:37.1090761Z 2025-05-07T20:32:37.1090853Z if scale_ub is not None: 2025-05-07T20:32:37.1090999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1091137Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1091216Z ) 2025-05-07T20:32:37.1091296Z else: 2025-05-07T20:32:37.1091388Z scale_ub_tensor = None 2025-05-07T20:32:37.1091457Z 2025-05-07T20:32:37.1091587Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1091675Z op = silu_mul_quant 2025-05-07T20:32:37.1091758Z if compiled: 2025-05-07T20:32:37.1091861Z op = torch.compile(op) 2025-05-07T20:32:37.1091964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1092039Z 2025-05-07T20:32:37.1092130Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1092135Z 2025-05-07T20:32:37.1092228Z moe/activation_test.py:117: 2025-05-07T20:32:37.1092362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1092464Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1092564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1093057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1093154Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1093509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1093784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1094121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1094784Z kernel = self.compile( 2025-05-07T20:32:37.1095165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1095340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1095474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1095478Z 2025-05-07T20:32:37.1095681Z self = 2025-05-07T20:32:37.1096491Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1096987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf1496c0>} 2025-05-07T20:32:37.1097720Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1097955Z context = 2025-05-07T20:32:37.1097963Z 2025-05-07T20:32:37.1101453Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1101744Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1101862Z module_map=module_map) 2025-05-07T20:32:37.1102030Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1102131Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1102209Z E ^ 2025-05-07T20:32:37.1102563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1102572Z 2025-05-07T20:32:37.1102982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1102987Z 2025-05-07T20:32:37.1103091Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1103421Z self=, 2025-05-07T20:32:37.1103503Z T=4096, 2025-05-07T20:32:37.1103584Z D=5120, 2025-05-07T20:32:37.1103666Z scale_ub=1200.0, 2025-05-07T20:32:37.1103754Z contiguous=False, 2025-05-07T20:32:37.1103839Z compiled=True, 2025-05-07T20:32:37.1103910Z ) 2025-05-07T20:32:37.1104126Z self = 2025-05-07T20:32:37.1104299Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1104304Z 2025-05-07T20:32:37.1104380Z @given( 2025-05-07T20:32:37.1104502Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1104604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1104718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1104837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1104953Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1105026Z ) 2025-05-07T20:32:37.1105274Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1105364Z def test_silu_mul_quant( 2025-05-07T20:32:37.1105441Z self, 2025-05-07T20:32:37.1105521Z T: int, 2025-05-07T20:32:37.1105597Z D: int, 2025-05-07T20:32:37.1105692Z scale_ub: Optional[float], 2025-05-07T20:32:37.1105782Z contiguous: bool, 2025-05-07T20:32:37.1105867Z compiled: bool, 2025-05-07T20:32:37.1105948Z ) -> None: 2025-05-07T20:32:37.1106041Z torch.manual_seed(2025) 2025-05-07T20:32:37.1106112Z 2025-05-07T20:32:37.1106279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1106417Z 2025-05-07T20:32:37.1106507Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1106634Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1106722Z x = x_sign * x_clamp 2025-05-07T20:32:37.1106804Z x0 = x[:, :D] 2025-05-07T20:32:37.1106892Z x1 = x[:, D:] 2025-05-07T20:32:37.1106963Z 2025-05-07T20:32:37.1107047Z if contiguous: 2025-05-07T20:32:37.1107142Z x0 = x0.contiguous() 2025-05-07T20:32:37.1107295Z x1 = x1.contiguous() 2025-05-07T20:32:37.1107368Z 2025-05-07T20:32:37.1107457Z if scale_ub is not None: 2025-05-07T20:32:37.1107560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1107696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1107773Z ) 2025-05-07T20:32:37.1107847Z else: 2025-05-07T20:32:37.1107943Z scale_ub_tensor = None 2025-05-07T20:32:37.1108016Z 2025-05-07T20:32:37.1108146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1108237Z op = silu_mul_quant 2025-05-07T20:32:37.1108320Z if compiled: 2025-05-07T20:32:37.1108418Z op = torch.compile(op) 2025-05-07T20:32:37.1108593Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1108667Z 2025-05-07T20:32:37.1108757Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1108765Z 2025-05-07T20:32:37.1108864Z moe/activation_test.py:117: 2025-05-07T20:32:37.1108994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1109096Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1109195Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1109556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1109652Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1110136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1110238Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1110593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1110881Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1111225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1111319Z kernel = self.compile( 2025-05-07T20:32:37.1111695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1111867Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1111993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1112003Z 2025-05-07T20:32:37.1112206Z self = 2025-05-07T20:32:37.1112978Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1113482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf14afc0>} 2025-05-07T20:32:37.1114217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1114407Z context = 2025-05-07T20:32:37.1114412Z 2025-05-07T20:32:37.1114579Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1114880Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1114986Z module_map=module_map) 2025-05-07T20:32:37.1115159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1115257Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1115338Z E ^ 2025-05-07T20:32:37.1115685Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1115728Z 2025-05-07T20:32:37.1116137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1116141Z 2025-05-07T20:32:37.1116244Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1116462Z self=, 2025-05-07T20:32:37.1116540Z T=2048, 2025-05-07T20:32:37.1116619Z D=7168, 2025-05-07T20:32:37.1116703Z scale_ub=1200.0, 2025-05-07T20:32:37.1116792Z contiguous=False, 2025-05-07T20:32:37.1116874Z compiled=False, 2025-05-07T20:32:37.1116947Z ) 2025-05-07T20:32:37.1117162Z self = 2025-05-07T20:32:37.1117374Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1117378Z 2025-05-07T20:32:37.1117459Z @given( 2025-05-07T20:32:37.1117577Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1117678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1117796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1117910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1118025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1118098Z ) 2025-05-07T20:32:37.1118339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1118439Z def test_silu_mul_quant( 2025-05-07T20:32:37.1118514Z self, 2025-05-07T20:32:37.1118590Z T: int, 2025-05-07T20:32:37.1118667Z D: int, 2025-05-07T20:32:37.1118765Z scale_ub: Optional[float], 2025-05-07T20:32:37.1118852Z contiguous: bool, 2025-05-07T20:32:37.1118941Z compiled: bool, 2025-05-07T20:32:37.1119061Z ) -> None: 2025-05-07T20:32:37.1119155Z torch.manual_seed(2025) 2025-05-07T20:32:37.1119231Z 2025-05-07T20:32:37.1119396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1119468Z 2025-05-07T20:32:37.1119561Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1119684Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1119773Z x = x_sign * x_clamp 2025-05-07T20:32:37.1119851Z x0 = x[:, :D] 2025-05-07T20:32:37.1119931Z x1 = x[:, D:] 2025-05-07T20:32:37.1120003Z 2025-05-07T20:32:37.1120084Z if contiguous: 2025-05-07T20:32:37.1120178Z x0 = x0.contiguous() 2025-05-07T20:32:37.1120267Z x1 = x1.contiguous() 2025-05-07T20:32:37.1120338Z 2025-05-07T20:32:37.1120427Z if scale_ub is not None: 2025-05-07T20:32:37.1120531Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1120666Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1120743Z ) 2025-05-07T20:32:37.1120821Z else: 2025-05-07T20:32:37.1120913Z scale_ub_tensor = None 2025-05-07T20:32:37.1120990Z 2025-05-07T20:32:37.1121116Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1121204Z op = silu_mul_quant 2025-05-07T20:32:37.1121289Z if compiled: 2025-05-07T20:32:37.1121388Z op = torch.compile(op) 2025-05-07T20:32:37.1121493Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1121568Z 2025-05-07T20:32:37.1121657Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1121706Z 2025-05-07T20:32:37.1121803Z moe/activation_test.py:117: 2025-05-07T20:32:37.1121935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1122034Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1122138Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1122629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1122726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1123121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1123340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1123673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1123770Z kernel = self.compile( 2025-05-07T20:32:37.1124150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1124322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1124487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1124492Z 2025-05-07T20:32:37.1124694Z self = 2025-05-07T20:32:37.1125457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1125955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78bf14bec0>} 2025-05-07T20:32:37.1126688Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1126879Z context = 2025-05-07T20:32:37.1126883Z 2025-05-07T20:32:37.1127049Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1127343Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1127451Z module_map=module_map) 2025-05-07T20:32:37.1127616Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1127711Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1127791Z E ^ 2025-05-07T20:32:37.1128142Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1128146Z 2025-05-07T20:32:37.1128547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1128555Z 2025-05-07T20:32:37.1128661Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1128880Z self=, 2025-05-07T20:32:37.1128962Z T=1, 2025-05-07T20:32:37.1129043Z D=7168, 2025-05-07T20:32:37.1129127Z scale_ub=None, 2025-05-07T20:32:37.1129213Z contiguous=True, 2025-05-07T20:32:37.1129302Z compiled=False, 2025-05-07T20:32:37.1129379Z ) 2025-05-07T20:32:37.1129592Z self = 2025-05-07T20:32:37.1129753Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1129757Z 2025-05-07T20:32:37.1129831Z @given( 2025-05-07T20:32:37.1129952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1130048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1130159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1130325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1130436Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1130508Z ) 2025-05-07T20:32:37.1130757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1130849Z def test_silu_mul_quant( 2025-05-07T20:32:37.1130924Z self, 2025-05-07T20:32:37.1131004Z T: int, 2025-05-07T20:32:37.1131080Z D: int, 2025-05-07T20:32:37.1131222Z scale_ub: Optional[float], 2025-05-07T20:32:37.1131311Z contiguous: bool, 2025-05-07T20:32:37.1131395Z compiled: bool, 2025-05-07T20:32:37.1131476Z ) -> None: 2025-05-07T20:32:37.1131568Z torch.manual_seed(2025) 2025-05-07T20:32:37.1131639Z 2025-05-07T20:32:37.1131805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1131876Z 2025-05-07T20:32:37.1131968Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1132099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1132184Z x = x_sign * x_clamp 2025-05-07T20:32:37.1132264Z x0 = x[:, :D] 2025-05-07T20:32:37.1132344Z x1 = x[:, D:] 2025-05-07T20:32:37.1132414Z 2025-05-07T20:32:37.1132536Z if contiguous: 2025-05-07T20:32:37.1132632Z x0 = x0.contiguous() 2025-05-07T20:32:37.1132721Z x1 = x1.contiguous() 2025-05-07T20:32:37.1132798Z 2025-05-07T20:32:37.1132886Z if scale_ub is not None: 2025-05-07T20:32:37.1132993Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1133127Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1133202Z ) 2025-05-07T20:32:37.1133277Z else: 2025-05-07T20:32:37.1133373Z scale_ub_tensor = None 2025-05-07T20:32:37.1133445Z 2025-05-07T20:32:37.1133572Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1133752Z op = silu_mul_quant 2025-05-07T20:32:37.1133836Z if compiled: 2025-05-07T20:32:37.1133932Z op = torch.compile(op) 2025-05-07T20:32:37.1134040Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1134113Z 2025-05-07T20:32:37.1134211Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1134216Z 2025-05-07T20:32:37.1134355Z moe/activation_test.py:117: 2025-05-07T20:32:37.1134485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1134590Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1134688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1135177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1135278Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1135653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1135905Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1136240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1136330Z kernel = self.compile( 2025-05-07T20:32:37.1136716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1136887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1137018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1137028Z 2025-05-07T20:32:37.1137230Z self = 2025-05-07T20:32:37.1137989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1138527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5bccc0>} 2025-05-07T20:32:37.1139262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1139456Z context = 2025-05-07T20:32:37.1139499Z 2025-05-07T20:32:37.1139664Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1139921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1140035Z module_map=module_map) 2025-05-07T20:32:37.1140194Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1140298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1140375Z E ^ 2025-05-07T20:32:37.1140720Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1140724Z 2025-05-07T20:32:37.1141198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1141203Z 2025-05-07T20:32:37.1141305Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1141526Z self=, 2025-05-07T20:32:37.1141606Z T=16384, 2025-05-07T20:32:37.1141682Z D=7168, 2025-05-07T20:32:37.1141772Z scale_ub=1200.0, 2025-05-07T20:32:37.1141857Z contiguous=False, 2025-05-07T20:32:37.1141942Z compiled=True, 2025-05-07T20:32:37.1142018Z ) 2025-05-07T20:32:37.1142233Z self = 2025-05-07T20:32:37.1142408Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1142413Z 2025-05-07T20:32:37.1142490Z @given( 2025-05-07T20:32:37.1142609Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1142709Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1142827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1142985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1143102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1143178Z ) 2025-05-07T20:32:37.1143418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1143512Z def test_silu_mul_quant( 2025-05-07T20:32:37.1143586Z self, 2025-05-07T20:32:37.1143661Z T: int, 2025-05-07T20:32:37.1143739Z D: int, 2025-05-07T20:32:37.1143837Z scale_ub: Optional[float], 2025-05-07T20:32:37.1143924Z contiguous: bool, 2025-05-07T20:32:37.1144014Z compiled: bool, 2025-05-07T20:32:37.1144090Z ) -> None: 2025-05-07T20:32:37.1144182Z torch.manual_seed(2025) 2025-05-07T20:32:37.1144260Z 2025-05-07T20:32:37.1144424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1144498Z 2025-05-07T20:32:37.1144590Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1144717Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1144810Z x = x_sign * x_clamp 2025-05-07T20:32:37.1144889Z x0 = x[:, :D] 2025-05-07T20:32:37.1144968Z x1 = x[:, D:] 2025-05-07T20:32:37.1145040Z 2025-05-07T20:32:37.1145121Z if contiguous: 2025-05-07T20:32:37.1145210Z x0 = x0.contiguous() 2025-05-07T20:32:37.1145302Z x1 = x1.contiguous() 2025-05-07T20:32:37.1145372Z 2025-05-07T20:32:37.1145478Z if scale_ub is not None: 2025-05-07T20:32:37.1145594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1145793Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1145870Z ) 2025-05-07T20:32:37.1145945Z else: 2025-05-07T20:32:37.1146037Z scale_ub_tensor = None 2025-05-07T20:32:37.1146111Z 2025-05-07T20:32:37.1146240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1146327Z op = silu_mul_quant 2025-05-07T20:32:37.1146415Z if compiled: 2025-05-07T20:32:37.1146513Z op = torch.compile(op) 2025-05-07T20:32:37.1146617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1146735Z 2025-05-07T20:32:37.1146823Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1146828Z 2025-05-07T20:32:37.1146921Z moe/activation_test.py:117: 2025-05-07T20:32:37.1147053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1147152Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1147253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1147614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1147706Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1148234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1148334Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1148685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1148911Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1149243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1149339Z kernel = self.compile( 2025-05-07T20:32:37.1149715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1149887Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1150015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1150019Z 2025-05-07T20:32:37.1150222Z self = 2025-05-07T20:32:37.1151028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1151525Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5be0c0>} 2025-05-07T20:32:37.1152256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1152447Z context = 2025-05-07T20:32:37.1152452Z 2025-05-07T20:32:37.1152613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1152873Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1152983Z module_map=module_map) 2025-05-07T20:32:37.1153143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1153244Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1153320Z E ^ 2025-05-07T20:32:37.1153668Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1153672Z 2025-05-07T20:32:37.1154077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1154082Z 2025-05-07T20:32:37.1154223Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1154445Z self=, 2025-05-07T20:32:37.1154522Z T=1, 2025-05-07T20:32:37.1154597Z D=7168, 2025-05-07T20:32:37.1154680Z scale_ub=None, 2025-05-07T20:32:37.1154769Z contiguous=False, 2025-05-07T20:32:37.1154854Z compiled=False, 2025-05-07T20:32:37.1154931Z ) 2025-05-07T20:32:37.1155144Z self = 2025-05-07T20:32:37.1155348Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1155353Z 2025-05-07T20:32:37.1155427Z @given( 2025-05-07T20:32:37.1155545Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1155648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1155760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1155874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1155993Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1156066Z ) 2025-05-07T20:32:37.1156309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1156401Z def test_silu_mul_quant( 2025-05-07T20:32:37.1156475Z self, 2025-05-07T20:32:37.1156592Z T: int, 2025-05-07T20:32:37.1156671Z D: int, 2025-05-07T20:32:37.1156767Z scale_ub: Optional[float], 2025-05-07T20:32:37.1156857Z contiguous: bool, 2025-05-07T20:32:37.1156946Z compiled: bool, 2025-05-07T20:32:37.1157023Z ) -> None: 2025-05-07T20:32:37.1157118Z torch.manual_seed(2025) 2025-05-07T20:32:37.1157188Z 2025-05-07T20:32:37.1157353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1157428Z 2025-05-07T20:32:37.1157517Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1157647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1157735Z x = x_sign * x_clamp 2025-05-07T20:32:37.1157814Z x0 = x[:, :D] 2025-05-07T20:32:37.1157896Z x1 = x[:, D:] 2025-05-07T20:32:37.1157966Z 2025-05-07T20:32:37.1158047Z if contiguous: 2025-05-07T20:32:37.1158140Z x0 = x0.contiguous() 2025-05-07T20:32:37.1158229Z x1 = x1.contiguous() 2025-05-07T20:32:37.1158299Z 2025-05-07T20:32:37.1158506Z if scale_ub is not None: 2025-05-07T20:32:37.1158612Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1158744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1158820Z ) 2025-05-07T20:32:37.1158898Z else: 2025-05-07T20:32:37.1158990Z scale_ub_tensor = None 2025-05-07T20:32:37.1159060Z 2025-05-07T20:32:37.1159188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1159276Z op = silu_mul_quant 2025-05-07T20:32:37.1159359Z if compiled: 2025-05-07T20:32:37.1159460Z op = torch.compile(op) 2025-05-07T20:32:37.1159565Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1159639Z 2025-05-07T20:32:37.1159727Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1159732Z 2025-05-07T20:32:37.1159826Z moe/activation_test.py:117: 2025-05-07T20:32:37.1159964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1160064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1160161Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1160655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1160749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1161108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1161326Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1161704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1161799Z kernel = self.compile( 2025-05-07T20:32:37.1162176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1162350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1162479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1162523Z 2025-05-07T20:32:37.1162724Z self = 2025-05-07T20:32:37.1163484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1163976Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d5bec00>} 2025-05-07T20:32:37.1164749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1164941Z context = 2025-05-07T20:32:37.1164945Z 2025-05-07T20:32:37.1165107Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1165392Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1165512Z module_map=module_map) 2025-05-07T20:32:37.1165686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1165784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1165858Z E ^ 2025-05-07T20:32:37.1166207Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1166214Z 2025-05-07T20:32:37.1166618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1166625Z 2025-05-07T20:32:37.1166727Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1166991Z self=, 2025-05-07T20:32:37.1167067Z T=2048, 2025-05-07T20:32:37.1167149Z D=7168, 2025-05-07T20:32:37.1167229Z scale_ub=None, 2025-05-07T20:32:37.1167314Z contiguous=False, 2025-05-07T20:32:37.1167398Z compiled=True, 2025-05-07T20:32:37.1167469Z ) 2025-05-07T20:32:37.1167683Z self = 2025-05-07T20:32:37.1167854Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1167858Z 2025-05-07T20:32:37.1167936Z @given( 2025-05-07T20:32:37.1168052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1168154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1168265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1168386Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1168502Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1168574Z ) 2025-05-07T20:32:37.1168817Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1168911Z def test_silu_mul_quant( 2025-05-07T20:32:37.1168986Z self, 2025-05-07T20:32:37.1169065Z T: int, 2025-05-07T20:32:37.1169140Z D: int, 2025-05-07T20:32:37.1169239Z scale_ub: Optional[float], 2025-05-07T20:32:37.1169330Z contiguous: bool, 2025-05-07T20:32:37.1169413Z compiled: bool, 2025-05-07T20:32:37.1169488Z ) -> None: 2025-05-07T20:32:37.1169584Z torch.manual_seed(2025) 2025-05-07T20:32:37.1169703Z 2025-05-07T20:32:37.1169872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1169943Z 2025-05-07T20:32:37.1170032Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1170157Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1170247Z x = x_sign * x_clamp 2025-05-07T20:32:37.1170330Z x0 = x[:, :D] 2025-05-07T20:32:37.1170411Z x1 = x[:, D:] 2025-05-07T20:32:37.1170482Z 2025-05-07T20:32:37.1170563Z if contiguous: 2025-05-07T20:32:37.1170730Z x0 = x0.contiguous() 2025-05-07T20:32:37.1170817Z x1 = x1.contiguous() 2025-05-07T20:32:37.1170887Z 2025-05-07T20:32:37.1170978Z if scale_ub is not None: 2025-05-07T20:32:37.1171082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1171218Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1171292Z ) 2025-05-07T20:32:37.1171367Z else: 2025-05-07T20:32:37.1171466Z scale_ub_tensor = None 2025-05-07T20:32:37.1171536Z 2025-05-07T20:32:37.1171663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1171753Z op = silu_mul_quant 2025-05-07T20:32:37.1171835Z if compiled: 2025-05-07T20:32:37.1171974Z op = torch.compile(op) 2025-05-07T20:32:37.1172085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1172156Z 2025-05-07T20:32:37.1172245Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1172253Z 2025-05-07T20:32:37.1172351Z moe/activation_test.py:117: 2025-05-07T20:32:37.1172477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1172576Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1172674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1173034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1173134Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1173618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1173807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1174209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1174428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1174766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1174856Z kernel = self.compile( 2025-05-07T20:32:37.1175232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1175418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1175561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1175571Z 2025-05-07T20:32:37.1175795Z self = 2025-05-07T20:32:37.1176557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1177051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12c2c0>} 2025-05-07T20:32:37.1177787Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1177975Z context = 2025-05-07T20:32:37.1178021Z 2025-05-07T20:32:37.1178186Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1178443Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1178550Z module_map=module_map) 2025-05-07T20:32:37.1178713Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1178809Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1178886Z E ^ 2025-05-07T20:32:37.1179232Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1179279Z 2025-05-07T20:32:37.1179683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1179687Z 2025-05-07T20:32:37.1179794Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1180013Z self=, 2025-05-07T20:32:37.1180097Z T=4096, 2025-05-07T20:32:37.1180174Z D=7168, 2025-05-07T20:32:37.1180255Z scale_ub=None, 2025-05-07T20:32:37.1180344Z contiguous=False, 2025-05-07T20:32:37.1180425Z compiled=True, 2025-05-07T20:32:37.1180495Z ) 2025-05-07T20:32:37.1180750Z self = 2025-05-07T20:32:37.1180922Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1180926Z 2025-05-07T20:32:37.1181000Z @given( 2025-05-07T20:32:37.1181124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1181220Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1181331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1181449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1181560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1181635Z ) 2025-05-07T20:32:37.1181873Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1181966Z def test_silu_mul_quant( 2025-05-07T20:32:37.1182042Z self, 2025-05-07T20:32:37.1182117Z T: int, 2025-05-07T20:32:37.1182192Z D: int, 2025-05-07T20:32:37.1182296Z scale_ub: Optional[float], 2025-05-07T20:32:37.1182383Z contiguous: bool, 2025-05-07T20:32:37.1182508Z compiled: bool, 2025-05-07T20:32:37.1182589Z ) -> None: 2025-05-07T20:32:37.1182682Z torch.manual_seed(2025) 2025-05-07T20:32:37.1182755Z 2025-05-07T20:32:37.1182924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1182994Z 2025-05-07T20:32:37.1183089Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1183211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1183301Z x = x_sign * x_clamp 2025-05-07T20:32:37.1183383Z x0 = x[:, :D] 2025-05-07T20:32:37.1183460Z x1 = x[:, D:] 2025-05-07T20:32:37.1183536Z 2025-05-07T20:32:37.1183619Z if contiguous: 2025-05-07T20:32:37.1183709Z x0 = x0.contiguous() 2025-05-07T20:32:37.1183797Z x1 = x1.contiguous() 2025-05-07T20:32:37.1183870Z 2025-05-07T20:32:37.1183958Z if scale_ub is not None: 2025-05-07T20:32:37.1184065Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1184202Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1184275Z ) 2025-05-07T20:32:37.1184353Z else: 2025-05-07T20:32:37.1184446Z scale_ub_tensor = None 2025-05-07T20:32:37.1184516Z 2025-05-07T20:32:37.1184645Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1184732Z op = silu_mul_quant 2025-05-07T20:32:37.1184813Z if compiled: 2025-05-07T20:32:37.1184914Z op = torch.compile(op) 2025-05-07T20:32:37.1185018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1185133Z 2025-05-07T20:32:37.1185226Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1185230Z 2025-05-07T20:32:37.1185324Z moe/activation_test.py:117: 2025-05-07T20:32:37.1185451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1185555Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1185651Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1186018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1186153Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1186637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1186736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1187088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1187308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1187644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1187735Z kernel = self.compile( 2025-05-07T20:32:37.1188153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1188536Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1188666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1188674Z 2025-05-07T20:32:37.1188877Z self = 2025-05-07T20:32:37.1189638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1190134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12cd60>} 2025-05-07T20:32:37.1190913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1191106Z context = 2025-05-07T20:32:37.1191113Z 2025-05-07T20:32:37.1191274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1191530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1191640Z module_map=module_map) 2025-05-07T20:32:37.1191801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1191897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1191977Z E ^ 2025-05-07T20:32:37.1192321Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1192326Z 2025-05-07T20:32:37.1192734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1192739Z 2025-05-07T20:32:37.1192843Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1193062Z self=, 2025-05-07T20:32:37.1193147Z T=16384, 2025-05-07T20:32:37.1193222Z D=5120, 2025-05-07T20:32:37.1193304Z scale_ub=1200.0, 2025-05-07T20:32:37.1193392Z contiguous=False, 2025-05-07T20:32:37.1193474Z compiled=False, 2025-05-07T20:32:37.1193548Z ) 2025-05-07T20:32:37.1193764Z self = 2025-05-07T20:32:37.1193941Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1193990Z 2025-05-07T20:32:37.1194070Z @given( 2025-05-07T20:32:37.1194187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1194284Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1194402Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1194519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1194631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1194706Z ) 2025-05-07T20:32:37.1194946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1195085Z def test_silu_mul_quant( 2025-05-07T20:32:37.1195159Z self, 2025-05-07T20:32:37.1195236Z T: int, 2025-05-07T20:32:37.1195313Z D: int, 2025-05-07T20:32:37.1195409Z scale_ub: Optional[float], 2025-05-07T20:32:37.1195498Z contiguous: bool, 2025-05-07T20:32:37.1195584Z compiled: bool, 2025-05-07T20:32:37.1195668Z ) -> None: 2025-05-07T20:32:37.1195765Z torch.manual_seed(2025) 2025-05-07T20:32:37.1195842Z 2025-05-07T20:32:37.1196005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1196077Z 2025-05-07T20:32:37.1196170Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1196333Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1196427Z x = x_sign * x_clamp 2025-05-07T20:32:37.1196506Z x0 = x[:, :D] 2025-05-07T20:32:37.1196584Z x1 = x[:, D:] 2025-05-07T20:32:37.1196663Z 2025-05-07T20:32:37.1196745Z if contiguous: 2025-05-07T20:32:37.1196833Z x0 = x0.contiguous() 2025-05-07T20:32:37.1196923Z x1 = x1.contiguous() 2025-05-07T20:32:37.1196993Z 2025-05-07T20:32:37.1197083Z if scale_ub is not None: 2025-05-07T20:32:37.1197192Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1197325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1197402Z ) 2025-05-07T20:32:37.1197479Z else: 2025-05-07T20:32:37.1197571Z scale_ub_tensor = None 2025-05-07T20:32:37.1197645Z 2025-05-07T20:32:37.1197770Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1197860Z op = silu_mul_quant 2025-05-07T20:32:37.1197949Z if compiled: 2025-05-07T20:32:37.1198090Z op = torch.compile(op) 2025-05-07T20:32:37.1198431Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1198512Z 2025-05-07T20:32:37.1198601Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1198605Z 2025-05-07T20:32:37.1198699Z moe/activation_test.py:117: 2025-05-07T20:32:37.1198829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1198926Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1199026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1199515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1199614Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1199974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1200199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1200537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1200634Z kernel = self.compile( 2025-05-07T20:32:37.1201010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1201185Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1201310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1201314Z 2025-05-07T20:32:37.1201515Z self = 2025-05-07T20:32:37.1202380Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1202876Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12dc60>} 2025-05-07T20:32:37.1203668Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1203854Z context = 2025-05-07T20:32:37.1203858Z 2025-05-07T20:32:37.1204021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1204285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1204392Z module_map=module_map) 2025-05-07T20:32:37.1204553Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1204707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1204783Z E ^ 2025-05-07T20:32:37.1205137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1205144Z 2025-05-07T20:32:37.1205547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1205552Z 2025-05-07T20:32:37.1205654Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1205873Z self=, 2025-05-07T20:32:37.1205949Z T=16384, 2025-05-07T20:32:37.1206026Z D=5120, 2025-05-07T20:32:37.1206112Z scale_ub=1200.0, 2025-05-07T20:32:37.1206194Z contiguous=True, 2025-05-07T20:32:37.1206278Z compiled=True, 2025-05-07T20:32:37.1206352Z ) 2025-05-07T20:32:37.1206567Z self = 2025-05-07T20:32:37.1206745Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1206749Z 2025-05-07T20:32:37.1206879Z @given( 2025-05-07T20:32:37.1207000Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1207099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1207212Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1207329Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1207440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1207512Z ) 2025-05-07T20:32:37.1207756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1207847Z def test_silu_mul_quant( 2025-05-07T20:32:37.1207924Z self, 2025-05-07T20:32:37.1208003Z T: int, 2025-05-07T20:32:37.1208079Z D: int, 2025-05-07T20:32:37.1208177Z scale_ub: Optional[float], 2025-05-07T20:32:37.1208264Z contiguous: bool, 2025-05-07T20:32:37.1208349Z compiled: bool, 2025-05-07T20:32:37.1208433Z ) -> None: 2025-05-07T20:32:37.1208530Z torch.manual_seed(2025) 2025-05-07T20:32:37.1208602Z 2025-05-07T20:32:37.1208769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1208843Z 2025-05-07T20:32:37.1208934Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1209060Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1209147Z x = x_sign * x_clamp 2025-05-07T20:32:37.1209224Z x0 = x[:, :D] 2025-05-07T20:32:37.1209305Z x1 = x[:, D:] 2025-05-07T20:32:37.1209375Z 2025-05-07T20:32:37.1209455Z if contiguous: 2025-05-07T20:32:37.1209549Z x0 = x0.contiguous() 2025-05-07T20:32:37.1209685Z x1 = x1.contiguous() 2025-05-07T20:32:37.1209761Z 2025-05-07T20:32:37.1209850Z if scale_ub is not None: 2025-05-07T20:32:37.1209956Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1210093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1210171Z ) 2025-05-07T20:32:37.1210249Z else: 2025-05-07T20:32:37.1210345Z scale_ub_tensor = None 2025-05-07T20:32:37.1210416Z 2025-05-07T20:32:37.1210587Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1210679Z op = silu_mul_quant 2025-05-07T20:32:37.1210762Z if compiled: 2025-05-07T20:32:37.1210859Z op = torch.compile(op) 2025-05-07T20:32:37.1210966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1211037Z 2025-05-07T20:32:37.1211131Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1211135Z 2025-05-07T20:32:37.1211234Z moe/activation_test.py:117: 2025-05-07T20:32:37.1211365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1211464Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1211561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1211964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1212060Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1212543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1212646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1212999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1213217Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1213553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1213647Z kernel = self.compile( 2025-05-07T20:32:37.1214118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1214296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1214464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1214469Z 2025-05-07T20:32:37.1214673Z self = 2025-05-07T20:32:37.1215448Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1215982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f78be12f380>} 2025-05-07T20:32:37.1216714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1216906Z context = 2025-05-07T20:32:37.1216910Z 2025-05-07T20:32:37.1217078Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1217334Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1217445Z module_map=module_map) 2025-05-07T20:32:37.1217604Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1217702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1217782Z E ^ 2025-05-07T20:32:37.1218127Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1218171Z 2025-05-07T20:32:37.1218575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1218583Z 2025-05-07T20:32:37.1218688Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1218909Z self=, 2025-05-07T20:32:37.1218989Z T=16384, 2025-05-07T20:32:37.1219065Z D=5120, 2025-05-07T20:32:37.1219186Z scale_ub=None, 2025-05-07T20:32:37.1219277Z contiguous=False, 2025-05-07T20:32:37.1219363Z compiled=True, 2025-05-07T20:32:37.1222341Z ) 2025-05-07T20:32:37.1222579Z self = 2025-05-07T20:32:37.1222756Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1222761Z 2025-05-07T20:32:37.1222844Z @given( 2025-05-07T20:32:37.1222965Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1223070Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1223190Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1223305Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1223487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1223562Z ) 2025-05-07T20:32:37.1223809Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1223905Z def test_silu_mul_quant( 2025-05-07T20:32:37.1223985Z self, 2025-05-07T20:32:37.1224060Z T: int, 2025-05-07T20:32:37.1224137Z D: int, 2025-05-07T20:32:37.1224233Z scale_ub: Optional[float], 2025-05-07T20:32:37.1224321Z contiguous: bool, 2025-05-07T20:32:37.1224408Z compiled: bool, 2025-05-07T20:32:37.1224488Z ) -> None: 2025-05-07T20:32:37.1224581Z torch.manual_seed(2025) 2025-05-07T20:32:37.1224654Z 2025-05-07T20:32:37.1224823Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1224894Z 2025-05-07T20:32:37.1224989Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1225111Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1225205Z x = x_sign * x_clamp 2025-05-07T20:32:37.1225286Z x0 = x[:, :D] 2025-05-07T20:32:37.1225409Z x1 = x[:, D:] 2025-05-07T20:32:37.1225485Z 2025-05-07T20:32:37.1225582Z if contiguous: 2025-05-07T20:32:37.1225683Z x0 = x0.contiguous() 2025-05-07T20:32:37.1225794Z x1 = x1.contiguous() 2025-05-07T20:32:37.1225870Z 2025-05-07T20:32:37.1225959Z if scale_ub is not None: 2025-05-07T20:32:37.1226067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1226200Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1226273Z ) 2025-05-07T20:32:37.1226350Z else: 2025-05-07T20:32:37.1226442Z scale_ub_tensor = None 2025-05-07T20:32:37.1226520Z 2025-05-07T20:32:37.1226648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1226735Z op = silu_mul_quant 2025-05-07T20:32:37.1226821Z if compiled: 2025-05-07T20:32:37.1226919Z op = torch.compile(op) 2025-05-07T20:32:37.1227030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1227107Z 2025-05-07T20:32:37.1227195Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1227200Z 2025-05-07T20:32:37.1227293Z moe/activation_test.py:117: 2025-05-07T20:32:37.1227428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1227526Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1227628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1227992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1228083Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1228653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1228748Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1229102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1229324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1229657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1229793Z kernel = self.compile( 2025-05-07T20:32:37.1230173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1230342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1230472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1230479Z 2025-05-07T20:32:37.1230682Z self = 2025-05-07T20:32:37.1231486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1231982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8a85e0>} 2025-05-07T20:32:37.1232718Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1232912Z context = 2025-05-07T20:32:37.1232916Z 2025-05-07T20:32:37.1233077Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1233341Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1233448Z module_map=module_map) 2025-05-07T20:32:37.1233609Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1233775Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1233853Z E ^ 2025-05-07T20:32:37.1234201Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1234211Z 2025-05-07T20:32:37.1234614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1234619Z 2025-05-07T20:32:37.1234719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1234942Z self=, 2025-05-07T20:32:37.1235017Z T=2048, 2025-05-07T20:32:37.1235096Z D=5120, 2025-05-07T20:32:37.1235180Z scale_ub=None, 2025-05-07T20:32:37.1235264Z contiguous=False, 2025-05-07T20:32:37.1235347Z compiled=True, 2025-05-07T20:32:37.1235424Z ) 2025-05-07T20:32:37.1235640Z self = 2025-05-07T20:32:37.1235817Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1235822Z 2025-05-07T20:32:37.1235897Z @given( 2025-05-07T20:32:37.1236014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1236117Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1236229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1236349Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1236463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1236535Z ) 2025-05-07T20:32:37.1236775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1236912Z def test_silu_mul_quant( 2025-05-07T20:32:37.1236986Z self, 2025-05-07T20:32:37.1237069Z T: int, 2025-05-07T20:32:37.1237143Z D: int, 2025-05-07T20:32:37.1237242Z scale_ub: Optional[float], 2025-05-07T20:32:37.1237333Z contiguous: bool, 2025-05-07T20:32:37.1237417Z compiled: bool, 2025-05-07T20:32:37.1237500Z ) -> None: 2025-05-07T20:32:37.1237593Z torch.manual_seed(2025) 2025-05-07T20:32:37.1237665Z 2025-05-07T20:32:37.1237878Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1237949Z 2025-05-07T20:32:37.1238040Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1238166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1238253Z x = x_sign * x_clamp 2025-05-07T20:32:37.1238334Z x0 = x[:, :D] 2025-05-07T20:32:37.1238415Z x1 = x[:, D:] 2025-05-07T20:32:37.1238485Z 2025-05-07T20:32:37.1238571Z if contiguous: 2025-05-07T20:32:37.1238660Z x0 = x0.contiguous() 2025-05-07T20:32:37.1238746Z x1 = x1.contiguous() 2025-05-07T20:32:37.1238818Z 2025-05-07T20:32:37.1238908Z if scale_ub is not None: 2025-05-07T20:32:37.1239056Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1239196Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1239271Z ) 2025-05-07T20:32:37.1239345Z else: 2025-05-07T20:32:37.1239439Z scale_ub_tensor = None 2025-05-07T20:32:37.1239513Z 2025-05-07T20:32:37.1239640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1239730Z op = silu_mul_quant 2025-05-07T20:32:37.1239813Z if compiled: 2025-05-07T20:32:37.1239911Z op = torch.compile(op) 2025-05-07T20:32:37.1240017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1240088Z 2025-05-07T20:32:37.1240179Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1240186Z 2025-05-07T20:32:37.1240281Z moe/activation_test.py:117: 2025-05-07T20:32:37.1240410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1240511Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1240611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1241016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1241114Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1241600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1241698Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1242051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1242269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1242608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1242700Z kernel = self.compile( 2025-05-07T20:32:37.1243079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1243256Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1243381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1243388Z 2025-05-07T20:32:37.1243592Z self = 2025-05-07T20:32:37.1244353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1244849Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8a9440>} 2025-05-07T20:32:37.1245682Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1245871Z context = 2025-05-07T20:32:37.1245875Z 2025-05-07T20:32:37.1246040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1246333Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1246443Z module_map=module_map) 2025-05-07T20:32:37.1246603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1246701Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1246778Z E ^ 2025-05-07T20:32:37.1247125Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1247130Z 2025-05-07T20:32:37.1247533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1247582Z 2025-05-07T20:32:37.1247686Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1247905Z self=, 2025-05-07T20:32:37.1247985Z T=2048, 2025-05-07T20:32:37.1248061Z D=5120, 2025-05-07T20:32:37.1248143Z scale_ub=1200.0, 2025-05-07T20:32:37.1248230Z contiguous=False, 2025-05-07T20:32:37.1248310Z compiled=True, 2025-05-07T20:32:37.1248382Z ) 2025-05-07T20:32:37.1248599Z self = 2025-05-07T20:32:37.1248769Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1248773Z 2025-05-07T20:32:37.1248852Z @given( 2025-05-07T20:32:37.1248968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1249065Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1249179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1249297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1249447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1249526Z ) 2025-05-07T20:32:37.1249767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1249862Z def test_silu_mul_quant( 2025-05-07T20:32:37.1249940Z self, 2025-05-07T20:32:37.1250017Z T: int, 2025-05-07T20:32:37.1250092Z D: int, 2025-05-07T20:32:37.1250190Z scale_ub: Optional[float], 2025-05-07T20:32:37.1250277Z contiguous: bool, 2025-05-07T20:32:37.1250366Z compiled: bool, 2025-05-07T20:32:37.1250443Z ) -> None: 2025-05-07T20:32:37.1250536Z torch.manual_seed(2025) 2025-05-07T20:32:37.1250613Z 2025-05-07T20:32:37.1250777Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1250848Z 2025-05-07T20:32:37.1250938Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1251062Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1251149Z x = x_sign * x_clamp 2025-05-07T20:32:37.1251235Z x0 = x[:, :D] 2025-05-07T20:32:37.1251317Z x1 = x[:, D:] 2025-05-07T20:32:37.1251387Z 2025-05-07T20:32:37.1251475Z if contiguous: 2025-05-07T20:32:37.1251564Z x0 = x0.contiguous() 2025-05-07T20:32:37.1251650Z x1 = x1.contiguous() 2025-05-07T20:32:37.1251725Z 2025-05-07T20:32:37.1251813Z if scale_ub is not None: 2025-05-07T20:32:37.1251918Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1252049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1252123Z ) 2025-05-07T20:32:37.1252247Z else: 2025-05-07T20:32:37.1252338Z scale_ub_tensor = None 2025-05-07T20:32:37.1252408Z 2025-05-07T20:32:37.1252537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1252625Z op = silu_mul_quant 2025-05-07T20:32:37.1252710Z if compiled: 2025-05-07T20:32:37.1252812Z op = torch.compile(op) 2025-05-07T20:32:37.1252922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1252992Z 2025-05-07T20:32:37.1253085Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1253130Z 2025-05-07T20:32:37.1253226Z moe/activation_test.py:117: 2025-05-07T20:32:37.1253355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1253454Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1253552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1254009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1254106Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1254590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1254688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1255086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1255308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1255669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1255771Z kernel = self.compile( 2025-05-07T20:32:37.1256170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1256339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1256468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1256475Z 2025-05-07T20:32:37.1256677Z self = 2025-05-07T20:32:37.1257475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1257974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8aa660>} 2025-05-07T20:32:37.1258707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1258896Z context = 2025-05-07T20:32:37.1258903Z 2025-05-07T20:32:37.1259065Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1259318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1259429Z module_map=module_map) 2025-05-07T20:32:37.1259593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1259693Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1259769Z E ^ 2025-05-07T20:32:37.1260117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1260122Z 2025-05-07T20:32:37.1260529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1260534Z 2025-05-07T20:32:37.1260635Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1260859Z self=, 2025-05-07T20:32:37.1260979Z T=4096, 2025-05-07T20:32:37.1261056Z D=5120, 2025-05-07T20:32:37.1261140Z scale_ub=1200.0, 2025-05-07T20:32:37.1261224Z contiguous=True, 2025-05-07T20:32:37.1261304Z compiled=True, 2025-05-07T20:32:37.1261379Z ) 2025-05-07T20:32:37.1261599Z self = 2025-05-07T20:32:37.1261765Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1261770Z 2025-05-07T20:32:37.1261888Z @given( 2025-05-07T20:32:37.1262005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1262104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1262216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1262331Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1262446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1262518Z ) 2025-05-07T20:32:37.1262762Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1262855Z def test_silu_mul_quant( 2025-05-07T20:32:37.1262933Z self, 2025-05-07T20:32:37.1263008Z T: int, 2025-05-07T20:32:37.1263086Z D: int, 2025-05-07T20:32:37.1263248Z scale_ub: Optional[float], 2025-05-07T20:32:37.1263338Z contiguous: bool, 2025-05-07T20:32:37.1263424Z compiled: bool, 2025-05-07T20:32:37.1263501Z ) -> None: 2025-05-07T20:32:37.1263598Z torch.manual_seed(2025) 2025-05-07T20:32:37.1263672Z 2025-05-07T20:32:37.1263837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1263910Z 2025-05-07T20:32:37.1263999Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1264121Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1264211Z x = x_sign * x_clamp 2025-05-07T20:32:37.1264290Z x0 = x[:, :D] 2025-05-07T20:32:37.1264368Z x1 = x[:, D:] 2025-05-07T20:32:37.1264444Z 2025-05-07T20:32:37.1264526Z if contiguous: 2025-05-07T20:32:37.1264617Z x0 = x0.contiguous() 2025-05-07T20:32:37.1264707Z x1 = x1.contiguous() 2025-05-07T20:32:37.1264777Z 2025-05-07T20:32:37.1264865Z if scale_ub is not None: 2025-05-07T20:32:37.1264974Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1265149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1265230Z ) 2025-05-07T20:32:37.1265310Z else: 2025-05-07T20:32:37.1265401Z scale_ub_tensor = None 2025-05-07T20:32:37.1265474Z 2025-05-07T20:32:37.1265600Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1265688Z op = silu_mul_quant 2025-05-07T20:32:37.1265772Z if compiled: 2025-05-07T20:32:37.1265870Z op = torch.compile(op) 2025-05-07T20:32:37.1265975Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1266052Z 2025-05-07T20:32:37.1266140Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1266144Z 2025-05-07T20:32:37.1266242Z moe/activation_test.py:117: 2025-05-07T20:32:37.1266371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1266471Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1266575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1266933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1267026Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1267512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1267607Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1267962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1268179Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1268558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1268653Z kernel = self.compile( 2025-05-07T20:32:37.1269033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1269209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1269339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1269384Z 2025-05-07T20:32:37.1269585Z self = 2025-05-07T20:32:37.1270347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1270841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d8ab9c0>} 2025-05-07T20:32:37.1271611Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1271799Z context = 2025-05-07T20:32:37.1271807Z 2025-05-07T20:32:37.1271968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1272225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1272333Z module_map=module_map) 2025-05-07T20:32:37.1272492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1272592Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1272671Z E ^ 2025-05-07T20:32:37.1273019Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1273024Z 2025-05-07T20:32:37.1273432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1273475Z 2025-05-07T20:32:37.1273577Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1273796Z self=, 2025-05-07T20:32:37.1273875Z T=128, 2025-05-07T20:32:37.1273954Z D=5120, 2025-05-07T20:32:37.1274037Z scale_ub=1200.0, 2025-05-07T20:32:37.1274123Z contiguous=False, 2025-05-07T20:32:37.1274208Z compiled=True, 2025-05-07T20:32:37.1274280Z ) 2025-05-07T20:32:37.1274496Z self = 2025-05-07T20:32:37.1274666Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1274673Z 2025-05-07T20:32:37.1274750Z @given( 2025-05-07T20:32:37.1274866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1274966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1275083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1275205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1275319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1275394Z ) 2025-05-07T20:32:37.1275642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1275734Z def test_silu_mul_quant( 2025-05-07T20:32:37.1275810Z self, 2025-05-07T20:32:37.1275890Z T: int, 2025-05-07T20:32:37.1275966Z D: int, 2025-05-07T20:32:37.1276064Z scale_ub: Optional[float], 2025-05-07T20:32:37.1276157Z contiguous: bool, 2025-05-07T20:32:37.1276243Z compiled: bool, 2025-05-07T20:32:37.1276367Z ) -> None: 2025-05-07T20:32:37.1276465Z torch.manual_seed(2025) 2025-05-07T20:32:37.1276538Z 2025-05-07T20:32:37.1276707Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1276780Z 2025-05-07T20:32:37.1276870Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1277000Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1277088Z x = x_sign * x_clamp 2025-05-07T20:32:37.1277167Z x0 = x[:, :D] 2025-05-07T20:32:37.1277248Z x1 = x[:, D:] 2025-05-07T20:32:37.1277361Z 2025-05-07T20:32:37.1277444Z if contiguous: 2025-05-07T20:32:37.1277536Z x0 = x0.contiguous() 2025-05-07T20:32:37.1277622Z x1 = x1.contiguous() 2025-05-07T20:32:37.1277694Z 2025-05-07T20:32:37.1277785Z if scale_ub is not None: 2025-05-07T20:32:37.1277888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1278021Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1278100Z ) 2025-05-07T20:32:37.1278175Z else: 2025-05-07T20:32:37.1278274Z scale_ub_tensor = None 2025-05-07T20:32:37.1278344Z 2025-05-07T20:32:37.1278471Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1278605Z op = silu_mul_quant 2025-05-07T20:32:37.1278690Z if compiled: 2025-05-07T20:32:37.1278789Z op = torch.compile(op) 2025-05-07T20:32:37.1278897Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1278973Z 2025-05-07T20:32:37.1279062Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1279066Z 2025-05-07T20:32:37.1279161Z moe/activation_test.py:117: 2025-05-07T20:32:37.1279292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1279390Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1279487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1279851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1279946Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1280434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1280532Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1280925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1281147Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1281481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1281577Z kernel = self.compile( 2025-05-07T20:32:37.1281954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1282123Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1282257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1282262Z 2025-05-07T20:32:37.1282461Z self = 2025-05-07T20:32:37.1283227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1283725Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d108fe0>} 2025-05-07T20:32:37.1284455Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1284688Z context = 2025-05-07T20:32:37.1284693Z 2025-05-07T20:32:37.1284854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1285114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1285223Z module_map=module_map) 2025-05-07T20:32:37.1285391Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1285507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1285648Z E ^ 2025-05-07T20:32:37.1285996Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1286001Z 2025-05-07T20:32:37.1286407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1286411Z 2025-05-07T20:32:37.1286515Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1286738Z self=, 2025-05-07T20:32:37.1286816Z T=16384, 2025-05-07T20:32:37.1286893Z D=7168, 2025-05-07T20:32:37.1286981Z scale_ub=1200.0, 2025-05-07T20:32:37.1287069Z contiguous=True, 2025-05-07T20:32:37.1287255Z compiled=True, 2025-05-07T20:32:37.1287334Z ) 2025-05-07T20:32:37.1287550Z self = 2025-05-07T20:32:37.1287724Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1287731Z 2025-05-07T20:32:37.1287807Z @given( 2025-05-07T20:32:37.1287923Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1288024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1288137Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1288253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1288370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1288446Z ) 2025-05-07T20:32:37.1288686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1288781Z def test_silu_mul_quant( 2025-05-07T20:32:37.1288859Z self, 2025-05-07T20:32:37.1288938Z T: int, 2025-05-07T20:32:37.1289019Z D: int, 2025-05-07T20:32:37.1289162Z scale_ub: Optional[float], 2025-05-07T20:32:37.1289254Z contiguous: bool, 2025-05-07T20:32:37.1289339Z compiled: bool, 2025-05-07T20:32:37.1289420Z ) -> None: 2025-05-07T20:32:37.1289517Z torch.manual_seed(2025) 2025-05-07T20:32:37.1289591Z 2025-05-07T20:32:37.1289756Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1289832Z 2025-05-07T20:32:37.1289923Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1290047Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1290139Z x = x_sign * x_clamp 2025-05-07T20:32:37.1290219Z x0 = x[:, :D] 2025-05-07T20:32:37.1290297Z x1 = x[:, D:] 2025-05-07T20:32:37.1290371Z 2025-05-07T20:32:37.1290452Z if contiguous: 2025-05-07T20:32:37.1290545Z x0 = x0.contiguous() 2025-05-07T20:32:37.1290633Z x1 = x1.contiguous() 2025-05-07T20:32:37.1290705Z 2025-05-07T20:32:37.1290795Z if scale_ub is not None: 2025-05-07T20:32:37.1290902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1291035Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1291115Z ) 2025-05-07T20:32:37.1291189Z else: 2025-05-07T20:32:37.1291281Z scale_ub_tensor = None 2025-05-07T20:32:37.1291353Z 2025-05-07T20:32:37.1291479Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1291567Z op = silu_mul_quant 2025-05-07T20:32:37.1291651Z if compiled: 2025-05-07T20:32:37.1291750Z op = torch.compile(op) 2025-05-07T20:32:37.1291906Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1291976Z 2025-05-07T20:32:37.1292065Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1292070Z 2025-05-07T20:32:37.1292168Z moe/activation_test.py:117: 2025-05-07T20:32:37.1292297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1292398Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1292497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1292857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1293014Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1293504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1293599Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1294017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1294242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1294575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1294708Z kernel = self.compile( 2025-05-07T20:32:37.1295087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1295267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1295397Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1295402Z 2025-05-07T20:32:37.1295636Z self = 2025-05-07T20:32:37.1296420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1296914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d109e40>} 2025-05-07T20:32:37.1297687Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1297877Z context = 2025-05-07T20:32:37.1297882Z 2025-05-07T20:32:37.1298043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1298531Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1298640Z module_map=module_map) 2025-05-07T20:32:37.1298804Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1298904Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1298978Z E ^ 2025-05-07T20:32:37.1299327Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1299331Z 2025-05-07T20:32:37.1299741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1299745Z 2025-05-07T20:32:37.1299855Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1300077Z self=, 2025-05-07T20:32:37.1300157Z T=16384, 2025-05-07T20:32:37.1300235Z D=5120, 2025-05-07T20:32:37.1300318Z scale_ub=1200.0, 2025-05-07T20:32:37.1300402Z contiguous=True, 2025-05-07T20:32:37.1300487Z compiled=False, 2025-05-07T20:32:37.1300560Z ) 2025-05-07T20:32:37.1300774Z self = 2025-05-07T20:32:37.1301029Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1301034Z 2025-05-07T20:32:37.1301109Z @given( 2025-05-07T20:32:37.1301230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1301331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1301446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1301562Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1301673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1301805Z ) 2025-05-07T20:32:37.1302049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1302139Z def test_silu_mul_quant( 2025-05-07T20:32:37.1302213Z self, 2025-05-07T20:32:37.1302293Z T: int, 2025-05-07T20:32:37.1302367Z D: int, 2025-05-07T20:32:37.1302468Z scale_ub: Optional[float], 2025-05-07T20:32:37.1302554Z contiguous: bool, 2025-05-07T20:32:37.1302641Z compiled: bool, 2025-05-07T20:32:37.1302720Z ) -> None: 2025-05-07T20:32:37.1302812Z torch.manual_seed(2025) 2025-05-07T20:32:37.1302884Z 2025-05-07T20:32:37.1303050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1303181Z 2025-05-07T20:32:37.1303274Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1303401Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1303487Z x = x_sign * x_clamp 2025-05-07T20:32:37.1303569Z x0 = x[:, :D] 2025-05-07T20:32:37.1303650Z x1 = x[:, D:] 2025-05-07T20:32:37.1303721Z 2025-05-07T20:32:37.1303804Z if contiguous: 2025-05-07T20:32:37.1303899Z x0 = x0.contiguous() 2025-05-07T20:32:37.1303986Z x1 = x1.contiguous() 2025-05-07T20:32:37.1304059Z 2025-05-07T20:32:37.1304149Z if scale_ub is not None: 2025-05-07T20:32:37.1304254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1304392Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1304465Z ) 2025-05-07T20:32:37.1304541Z else: 2025-05-07T20:32:37.1304636Z scale_ub_tensor = None 2025-05-07T20:32:37.1304708Z 2025-05-07T20:32:37.1304837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1304995Z op = silu_mul_quant 2025-05-07T20:32:37.1305079Z if compiled: 2025-05-07T20:32:37.1305177Z op = torch.compile(op) 2025-05-07T20:32:37.1305285Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1305355Z 2025-05-07T20:32:37.1305449Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1305454Z 2025-05-07T20:32:37.1305548Z moe/activation_test.py:117: 2025-05-07T20:32:37.1305674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1305774Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1305871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1306362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1306461Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1306818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1307041Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1307376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1307469Z kernel = self.compile( 2025-05-07T20:32:37.1307848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1308018Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1308142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1308194Z 2025-05-07T20:32:37.1308396Z self = 2025-05-07T20:32:37.1309163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1309659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d10aca0>} 2025-05-07T20:32:37.1310429Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1310617Z context = 2025-05-07T20:32:37.1310622Z 2025-05-07T20:32:37.1310784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1311040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1311148Z module_map=module_map) 2025-05-07T20:32:37.1311343Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1311447Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1311523Z E ^ 2025-05-07T20:32:37.1311868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1311875Z 2025-05-07T20:32:37.1312283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1312287Z 2025-05-07T20:32:37.1312388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1312604Z self=, 2025-05-07T20:32:37.1312684Z T=1, 2025-05-07T20:32:37.1312758Z D=7168, 2025-05-07T20:32:37.1312843Z scale_ub=1200.0, 2025-05-07T20:32:37.1312927Z contiguous=False, 2025-05-07T20:32:37.1313011Z compiled=False, 2025-05-07T20:32:37.1313085Z ) 2025-05-07T20:32:37.1313301Z self = 2025-05-07T20:32:37.1313505Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1313510Z 2025-05-07T20:32:37.1313587Z @given( 2025-05-07T20:32:37.1313705Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1313806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1313927Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1314041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1314155Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1314227Z ) 2025-05-07T20:32:37.1314466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1314564Z def test_silu_mul_quant( 2025-05-07T20:32:37.1314640Z self, 2025-05-07T20:32:37.1314714Z T: int, 2025-05-07T20:32:37.1314792Z D: int, 2025-05-07T20:32:37.1314888Z scale_ub: Optional[float], 2025-05-07T20:32:37.1314978Z contiguous: bool, 2025-05-07T20:32:37.1315067Z compiled: bool, 2025-05-07T20:32:37.1315144Z ) -> None: 2025-05-07T20:32:37.1315237Z torch.manual_seed(2025) 2025-05-07T20:32:37.1315310Z 2025-05-07T20:32:37.1315476Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1315549Z 2025-05-07T20:32:37.1315638Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1315762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1315851Z x = x_sign * x_clamp 2025-05-07T20:32:37.1315929Z x0 = x[:, :D] 2025-05-07T20:32:37.1316006Z x1 = x[:, D:] 2025-05-07T20:32:37.1316079Z 2025-05-07T20:32:37.1316207Z if contiguous: 2025-05-07T20:32:37.1316297Z x0 = x0.contiguous() 2025-05-07T20:32:37.1316388Z x1 = x1.contiguous() 2025-05-07T20:32:37.1316458Z 2025-05-07T20:32:37.1316547Z if scale_ub is not None: 2025-05-07T20:32:37.1316656Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1316792Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1316868Z ) 2025-05-07T20:32:37.1316942Z else: 2025-05-07T20:32:37.1317034Z scale_ub_tensor = None 2025-05-07T20:32:37.1317149Z 2025-05-07T20:32:37.1317277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1317365Z op = silu_mul_quant 2025-05-07T20:32:37.1317453Z if compiled: 2025-05-07T20:32:37.1317550Z op = torch.compile(op) 2025-05-07T20:32:37.1317654Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1317729Z 2025-05-07T20:32:37.1317820Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1317827Z 2025-05-07T20:32:37.1317921Z moe/activation_test.py:117: 2025-05-07T20:32:37.1318051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1318149Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1318290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1318780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1318878Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1319237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1319457Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1319790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1319884Z kernel = self.compile( 2025-05-07T20:32:37.1320263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1320435Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1320563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1320568Z 2025-05-07T20:32:37.1320808Z self = 2025-05-07T20:32:37.1321570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1322066Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20c0e0>} 2025-05-07T20:32:37.1322802Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1322994Z context = 2025-05-07T20:32:37.1323001Z 2025-05-07T20:32:37.1323170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1323423Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1323531Z module_map=module_map) 2025-05-07T20:32:37.1323693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1323790Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1323864Z E ^ 2025-05-07T20:32:37.1324213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1324217Z 2025-05-07T20:32:37.1324688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1324692Z 2025-05-07T20:32:37.1324798Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1325017Z self=, 2025-05-07T20:32:37.1325094Z T=4096, 2025-05-07T20:32:37.1325176Z D=7168, 2025-05-07T20:32:37.1325259Z scale_ub=1200.0, 2025-05-07T20:32:37.1325356Z contiguous=False, 2025-05-07T20:32:37.1325456Z compiled=True, 2025-05-07T20:32:37.1325592Z ) 2025-05-07T20:32:37.1325812Z self = 2025-05-07T20:32:37.1325987Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1325991Z 2025-05-07T20:32:37.1326067Z @given( 2025-05-07T20:32:37.1326187Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1326286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1326405Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1326523Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1326636Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1326710Z ) 2025-05-07T20:32:37.1326992Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1327089Z def test_silu_mul_quant( 2025-05-07T20:32:37.1327169Z self, 2025-05-07T20:32:37.1327246Z T: int, 2025-05-07T20:32:37.1327326Z D: int, 2025-05-07T20:32:37.1327425Z scale_ub: Optional[float], 2025-05-07T20:32:37.1327514Z contiguous: bool, 2025-05-07T20:32:37.1327598Z compiled: bool, 2025-05-07T20:32:37.1327678Z ) -> None: 2025-05-07T20:32:37.1327771Z torch.manual_seed(2025) 2025-05-07T20:32:37.1327844Z 2025-05-07T20:32:37.1328012Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1328084Z 2025-05-07T20:32:37.1328176Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1328302Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1328389Z x = x_sign * x_clamp 2025-05-07T20:32:37.1328468Z x0 = x[:, :D] 2025-05-07T20:32:37.1328547Z x1 = x[:, D:] 2025-05-07T20:32:37.1328621Z 2025-05-07T20:32:37.1328708Z if contiguous: 2025-05-07T20:32:37.1328841Z x0 = x0.contiguous() 2025-05-07T20:32:37.1328930Z x1 = x1.contiguous() 2025-05-07T20:32:37.1329003Z 2025-05-07T20:32:37.1329096Z if scale_ub is not None: 2025-05-07T20:32:37.1329200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1329335Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1329409Z ) 2025-05-07T20:32:37.1329482Z else: 2025-05-07T20:32:37.1329576Z scale_ub_tensor = None 2025-05-07T20:32:37.1329646Z 2025-05-07T20:32:37.1329772Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1329866Z op = silu_mul_quant 2025-05-07T20:32:37.1329947Z if compiled: 2025-05-07T20:32:37.1330050Z op = torch.compile(op) 2025-05-07T20:32:37.1330154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1330225Z 2025-05-07T20:32:37.1330320Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1330327Z 2025-05-07T20:32:37.1330421Z moe/activation_test.py:117: 2025-05-07T20:32:37.1330549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1330653Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1330750Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1331108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1331203Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1331687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1331829Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1332181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1332402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1332744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1332837Z kernel = self.compile( 2025-05-07T20:32:37.1333258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1333429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1333553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1333558Z 2025-05-07T20:32:37.1333841Z self = 2025-05-07T20:32:37.1334604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1335144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20d300>} 2025-05-07T20:32:37.1335929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1336117Z context = 2025-05-07T20:32:37.1336122Z 2025-05-07T20:32:37.1336285Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1336539Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1336652Z module_map=module_map) 2025-05-07T20:32:37.1336810Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1336907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1336987Z E ^ 2025-05-07T20:32:37.1337372Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1337377Z 2025-05-07T20:32:37.1337780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1337790Z 2025-05-07T20:32:37.1337891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1338108Z self=, 2025-05-07T20:32:37.1338185Z T=128, 2025-05-07T20:32:37.1338261Z D=7168, 2025-05-07T20:32:37.1338343Z scale_ub=1200.0, 2025-05-07T20:32:37.1338432Z contiguous=False, 2025-05-07T20:32:37.1338513Z compiled=True, 2025-05-07T20:32:37.1338585Z ) 2025-05-07T20:32:37.1338801Z self = 2025-05-07T20:32:37.1338971Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:37.1338975Z 2025-05-07T20:32:37.1339051Z @given( 2025-05-07T20:32:37.1339170Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1339267Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1339385Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1339500Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1339611Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1339688Z ) 2025-05-07T20:32:37.1339931Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1342856Z def test_silu_mul_quant( 2025-05-07T20:32:37.1342944Z self, 2025-05-07T20:32:37.1343097Z T: int, 2025-05-07T20:32:37.1343178Z D: int, 2025-05-07T20:32:37.1343278Z scale_ub: Optional[float], 2025-05-07T20:32:37.1343367Z contiguous: bool, 2025-05-07T20:32:37.1343458Z compiled: bool, 2025-05-07T20:32:37.1343538Z ) -> None: 2025-05-07T20:32:37.1343635Z torch.manual_seed(2025) 2025-05-07T20:32:37.1343713Z 2025-05-07T20:32:37.1343884Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1343958Z 2025-05-07T20:32:37.1344092Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1344215Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1344304Z x = x_sign * x_clamp 2025-05-07T20:32:37.1344382Z x0 = x[:, :D] 2025-05-07T20:32:37.1344460Z x1 = x[:, D:] 2025-05-07T20:32:37.1344535Z 2025-05-07T20:32:37.1344619Z if contiguous: 2025-05-07T20:32:37.1344708Z x0 = x0.contiguous() 2025-05-07T20:32:37.1344802Z x1 = x1.contiguous() 2025-05-07T20:32:37.1344874Z 2025-05-07T20:32:37.1344962Z if scale_ub is not None: 2025-05-07T20:32:37.1345071Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1345203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1345323Z ) 2025-05-07T20:32:37.1345399Z else: 2025-05-07T20:32:37.1345493Z scale_ub_tensor = None 2025-05-07T20:32:37.1345568Z 2025-05-07T20:32:37.1345695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1345787Z op = silu_mul_quant 2025-05-07T20:32:37.1345873Z if compiled: 2025-05-07T20:32:37.1345971Z op = torch.compile(op) 2025-05-07T20:32:37.1346074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1346149Z 2025-05-07T20:32:37.1346237Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1346242Z 2025-05-07T20:32:37.1346338Z moe/activation_test.py:117: 2025-05-07T20:32:37.1346474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1346573Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1346675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1347045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1347177Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1347667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1347764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1348118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1348340Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1348674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1348770Z kernel = self.compile( 2025-05-07T20:32:37.1349149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1349323Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1349455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1349459Z 2025-05-07T20:32:37.1349661Z self = 2025-05-07T20:32:37.1350428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1350922Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20e160>} 2025-05-07T20:32:37.1351698Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1351889Z context = 2025-05-07T20:32:37.1351893Z 2025-05-07T20:32:37.1352057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1352318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1352466Z module_map=module_map) 2025-05-07T20:32:37.1352627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1352725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1352801Z E ^ 2025-05-07T20:32:37.1353155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1353162Z 2025-05-07T20:32:37.1353566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1353570Z 2025-05-07T20:32:37.1353671Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1353930Z self=, 2025-05-07T20:32:37.1354009Z T=2048, 2025-05-07T20:32:37.1354084Z D=7168, 2025-05-07T20:32:37.1354170Z scale_ub=None, 2025-05-07T20:32:37.1354256Z contiguous=True, 2025-05-07T20:32:37.1354341Z compiled=True, 2025-05-07T20:32:37.1354414Z ) 2025-05-07T20:32:37.1354629Z self = 2025-05-07T20:32:37.1354797Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.1354801Z 2025-05-07T20:32:37.1354874Z @given( 2025-05-07T20:32:37.1354991Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1355093Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1355205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1355320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1355434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1355508Z ) 2025-05-07T20:32:37.1355819Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1355911Z def test_silu_mul_quant( 2025-05-07T20:32:37.1355987Z self, 2025-05-07T20:32:37.1356072Z T: int, 2025-05-07T20:32:37.1356149Z D: int, 2025-05-07T20:32:37.1356246Z scale_ub: Optional[float], 2025-05-07T20:32:37.1356338Z contiguous: bool, 2025-05-07T20:32:37.1356421Z compiled: bool, 2025-05-07T20:32:37.1356502Z ) -> None: 2025-05-07T20:32:37.1356597Z torch.manual_seed(2025) 2025-05-07T20:32:37.1356667Z 2025-05-07T20:32:37.1356834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1356910Z 2025-05-07T20:32:37.1357000Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1357126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1357212Z x = x_sign * x_clamp 2025-05-07T20:32:37.1357292Z x0 = x[:, :D] 2025-05-07T20:32:37.1357376Z x1 = x[:, D:] 2025-05-07T20:32:37.1357448Z 2025-05-07T20:32:37.1357535Z if contiguous: 2025-05-07T20:32:37.1357624Z x0 = x0.contiguous() 2025-05-07T20:32:37.1357712Z x1 = x1.contiguous() 2025-05-07T20:32:37.1357793Z 2025-05-07T20:32:37.1357881Z if scale_ub is not None: 2025-05-07T20:32:37.1357986Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1358121Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1358195Z ) 2025-05-07T20:32:37.1358271Z else: 2025-05-07T20:32:37.1358367Z scale_ub_tensor = None 2025-05-07T20:32:37.1358438Z 2025-05-07T20:32:37.1358612Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1358703Z op = silu_mul_quant 2025-05-07T20:32:37.1358787Z if compiled: 2025-05-07T20:32:37.1358886Z op = torch.compile(op) 2025-05-07T20:32:37.1358991Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1359062Z 2025-05-07T20:32:37.1359158Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1359162Z 2025-05-07T20:32:37.1359258Z moe/activation_test.py:117: 2025-05-07T20:32:37.1359430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1359531Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1359628Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1359989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1360083Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1360569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1360669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1361022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1361282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1361621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1361716Z kernel = self.compile( 2025-05-07T20:32:37.1362094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1362264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1362391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1362396Z 2025-05-07T20:32:37.1362603Z self = 2025-05-07T20:32:37.1363366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1363903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d20f420>} 2025-05-07T20:32:37.1364635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1364822Z context = 2025-05-07T20:32:37.1364827Z 2025-05-07T20:32:37.1364991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1365248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1365358Z module_map=module_map) 2025-05-07T20:32:37.1365516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1365616Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1365697Z E ^ 2025-05-07T20:32:37.1366042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1366049Z 2025-05-07T20:32:37.1366455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1366459Z 2025-05-07T20:32:37.1366560Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1366778Z self=, 2025-05-07T20:32:37.1366859Z T=16384, 2025-05-07T20:32:37.1366933Z D=5120, 2025-05-07T20:32:37.1367054Z scale_ub=None, 2025-05-07T20:32:37.1367141Z contiguous=False, 2025-05-07T20:32:37.1367225Z compiled=False, 2025-05-07T20:32:37.1367297Z ) 2025-05-07T20:32:37.1367513Z self = 2025-05-07T20:32:37.1367688Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1367694Z 2025-05-07T20:32:37.1367772Z @given( 2025-05-07T20:32:37.1367888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1368024Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1368142Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1368256Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1368367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1368441Z ) 2025-05-07T20:32:37.1368680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1368775Z def test_silu_mul_quant( 2025-05-07T20:32:37.1368853Z self, 2025-05-07T20:32:37.1368927Z T: int, 2025-05-07T20:32:37.1369004Z D: int, 2025-05-07T20:32:37.1369100Z scale_ub: Optional[float], 2025-05-07T20:32:37.1369187Z contiguous: bool, 2025-05-07T20:32:37.1369318Z compiled: bool, 2025-05-07T20:32:37.1369396Z ) -> None: 2025-05-07T20:32:37.1369490Z torch.manual_seed(2025) 2025-05-07T20:32:37.1369565Z 2025-05-07T20:32:37.1369728Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1369802Z 2025-05-07T20:32:37.1369896Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1370019Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1371784Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1371793Z 2025-05-07T20:32:37.1371948Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.1371953Z 2025-05-07T20:32:37.1372057Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1372279Z self=, 2025-05-07T20:32:37.1372353Z T=4096, 2025-05-07T20:32:37.1372431Z D=7168, 2025-05-07T20:32:37.1372512Z scale_ub=1200.0, 2025-05-07T20:32:37.1372594Z contiguous=True, 2025-05-07T20:32:37.1372677Z compiled=True, 2025-05-07T20:32:37.1372748Z ) 2025-05-07T20:32:37.1372960Z self = 2025-05-07T20:32:37.1373132Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1373136Z 2025-05-07T20:32:37.1373212Z @given( 2025-05-07T20:32:37.1373329Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1373432Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1373546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1373748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1373860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1373937Z ) 2025-05-07T20:32:37.1374180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1374271Z def test_silu_mul_quant( 2025-05-07T20:32:37.1374346Z self, 2025-05-07T20:32:37.1374424Z T: int, 2025-05-07T20:32:37.1374500Z D: int, 2025-05-07T20:32:37.1374596Z scale_ub: Optional[float], 2025-05-07T20:32:37.1374686Z contiguous: bool, 2025-05-07T20:32:37.1374819Z compiled: bool, 2025-05-07T20:32:37.1374899Z ) -> None: 2025-05-07T20:32:37.1374992Z torch.manual_seed(2025) 2025-05-07T20:32:37.1375062Z 2025-05-07T20:32:37.1375230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1375305Z 2025-05-07T20:32:37.1375399Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1375538Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1377307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1377355Z 2025-05-07T20:32:37.1377474Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.1377478Z 2025-05-07T20:32:37.1377578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1377833Z self=, 2025-05-07T20:32:37.1377919Z T=16384, 2025-05-07T20:32:37.1377994Z D=7168, 2025-05-07T20:32:37.1378076Z scale_ub=None, 2025-05-07T20:32:37.1378160Z contiguous=False, 2025-05-07T20:32:37.1378244Z compiled=False, 2025-05-07T20:32:37.1378320Z ) 2025-05-07T20:32:37.1378534Z self = 2025-05-07T20:32:37.1378704Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1378709Z 2025-05-07T20:32:37.1378786Z @given( 2025-05-07T20:32:37.1378901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1378998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1379117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1379230Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1379344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1379416Z ) 2025-05-07T20:32:37.1379700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1379797Z def test_silu_mul_quant( 2025-05-07T20:32:37.1379872Z self, 2025-05-07T20:32:37.1379947Z T: int, 2025-05-07T20:32:37.1380030Z D: int, 2025-05-07T20:32:37.1380128Z scale_ub: Optional[float], 2025-05-07T20:32:37.1380214Z contiguous: bool, 2025-05-07T20:32:37.1380301Z compiled: bool, 2025-05-07T20:32:37.1380376Z ) -> None: 2025-05-07T20:32:37.1380468Z torch.manual_seed(2025) 2025-05-07T20:32:37.1380543Z 2025-05-07T20:32:37.1380707Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1382452Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1382460Z 2025-05-07T20:32:37.1382575Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1382579Z 2025-05-07T20:32:37.1382681Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1382899Z self=, 2025-05-07T20:32:37.1382974Z T=2048, 2025-05-07T20:32:37.1383054Z D=7168, 2025-05-07T20:32:37.1383178Z scale_ub=1200.0, 2025-05-07T20:32:37.1383260Z contiguous=True, 2025-05-07T20:32:37.1383342Z compiled=True, 2025-05-07T20:32:37.1383413Z ) 2025-05-07T20:32:37.1383624Z self = 2025-05-07T20:32:37.1383795Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1383799Z 2025-05-07T20:32:37.1383874Z @given( 2025-05-07T20:32:37.1383992Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1384087Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1384240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1384356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1384466Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1384539Z ) 2025-05-07T20:32:37.1384780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1384871Z def test_silu_mul_quant( 2025-05-07T20:32:37.1384947Z self, 2025-05-07T20:32:37.1385026Z T: int, 2025-05-07T20:32:37.1385101Z D: int, 2025-05-07T20:32:37.1385198Z scale_ub: Optional[float], 2025-05-07T20:32:37.1385284Z contiguous: bool, 2025-05-07T20:32:37.1385371Z compiled: bool, 2025-05-07T20:32:37.1385534Z ) -> None: 2025-05-07T20:32:37.1385651Z torch.manual_seed(2025) 2025-05-07T20:32:37.1385725Z 2025-05-07T20:32:37.1385890Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1385964Z 2025-05-07T20:32:37.1386054Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1386179Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1387896Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1388759Z 2025-05-07T20:32:37.1388916Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.1388921Z 2025-05-07T20:32:37.1389021Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1389239Z self=, 2025-05-07T20:32:37.1389321Z T=2048, 2025-05-07T20:32:37.1389398Z D=7168, 2025-05-07T20:32:37.1389480Z scale_ub=None, 2025-05-07T20:32:37.1389563Z contiguous=True, 2025-05-07T20:32:37.1389644Z compiled=False, 2025-05-07T20:32:37.1389717Z ) 2025-05-07T20:32:37.1389927Z self = 2025-05-07T20:32:37.1390091Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1390098Z 2025-05-07T20:32:37.1390178Z @given( 2025-05-07T20:32:37.1390292Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1390390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1390508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1390625Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1390739Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1390813Z ) 2025-05-07T20:32:37.1391053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1391146Z def test_silu_mul_quant( 2025-05-07T20:32:37.1391220Z self, 2025-05-07T20:32:37.1391295Z T: int, 2025-05-07T20:32:37.1391373Z D: int, 2025-05-07T20:32:37.1391469Z scale_ub: Optional[float], 2025-05-07T20:32:37.1391555Z contiguous: bool, 2025-05-07T20:32:37.1391640Z compiled: bool, 2025-05-07T20:32:37.1391763Z ) -> None: 2025-05-07T20:32:37.1391855Z torch.manual_seed(2025) 2025-05-07T20:32:37.1391930Z 2025-05-07T20:32:37.1392092Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1392167Z 2025-05-07T20:32:37.1392259Z > x_sign = torch.sign(x) 2025-05-07T20:32:37.1393975Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1394025Z 2025-05-07T20:32:37.1394139Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:37.1394146Z 2025-05-07T20:32:37.1394245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1394464Z self=, 2025-05-07T20:32:37.1394539Z T=1, 2025-05-07T20:32:37.1394613Z D=7168, 2025-05-07T20:32:37.1394735Z scale_ub=1200.0, 2025-05-07T20:32:37.1394821Z contiguous=True, 2025-05-07T20:32:37.1394902Z compiled=False, 2025-05-07T20:32:37.1394975Z ) 2025-05-07T20:32:37.1395187Z self = 2025-05-07T20:32:37.1395353Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1395357Z 2025-05-07T20:32:37.1395447Z @given( 2025-05-07T20:32:37.1395575Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1395694Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1395805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1395921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1396035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1396106Z ) 2025-05-07T20:32:37.1396350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1396444Z def test_silu_mul_quant( 2025-05-07T20:32:37.1396520Z self, 2025-05-07T20:32:37.1396639Z T: int, 2025-05-07T20:32:37.1396716Z D: int, 2025-05-07T20:32:37.1396814Z scale_ub: Optional[float], 2025-05-07T20:32:37.1396905Z contiguous: bool, 2025-05-07T20:32:37.1396988Z compiled: bool, 2025-05-07T20:32:37.1397066Z ) -> None: 2025-05-07T20:32:37.1397161Z torch.manual_seed(2025) 2025-05-07T20:32:37.1397232Z 2025-05-07T20:32:37.1397398Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1397472Z 2025-05-07T20:32:37.1397561Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1397683Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1397776Z x = x_sign * x_clamp 2025-05-07T20:32:37.1397856Z x0 = x[:, :D] 2025-05-07T20:32:37.1397938Z x1 = x[:, D:] 2025-05-07T20:32:37.1398009Z 2025-05-07T20:32:37.1398092Z if contiguous: 2025-05-07T20:32:37.1398346Z x0 = x0.contiguous() 2025-05-07T20:32:37.1398440Z x1 = x1.contiguous() 2025-05-07T20:32:37.1398511Z 2025-05-07T20:32:37.1398602Z if scale_ub is not None: 2025-05-07T20:32:37.1398706Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1398841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1398917Z ) 2025-05-07T20:32:37.1398992Z else: 2025-05-07T20:32:37.1399084Z scale_ub_tensor = None 2025-05-07T20:32:37.1399158Z 2025-05-07T20:32:37.1399284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1399375Z op = silu_mul_quant 2025-05-07T20:32:37.1399538Z if compiled: 2025-05-07T20:32:37.1399636Z op = torch.compile(op) 2025-05-07T20:32:37.1399741Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1399813Z 2025-05-07T20:32:37.1399903Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1399907Z 2025-05-07T20:32:37.1400009Z moe/activation_test.py:117: 2025-05-07T20:32:37.1400138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1400237Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1400400Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1400892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1400989Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1401347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1401568Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1401905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1402002Z kernel = self.compile( 2025-05-07T20:32:37.1402438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1402611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1402740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1402747Z 2025-05-07T20:32:37.1402946Z self = 2025-05-07T20:32:37.1403709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1404204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d7b22a0>} 2025-05-07T20:32:37.1404995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1405185Z context = 2025-05-07T20:32:37.1405192Z 2025-05-07T20:32:37.1405367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1405663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1405773Z module_map=module_map) 2025-05-07T20:32:37.1405933Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1406032Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1406110Z E ^ 2025-05-07T20:32:37.1406460Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1406464Z 2025-05-07T20:32:37.1406872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1406879Z 2025-05-07T20:32:37.1406979Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1407202Z self=, 2025-05-07T20:32:37.1407279Z T=128, 2025-05-07T20:32:37.1407356Z D=5120, 2025-05-07T20:32:37.1407437Z scale_ub=None, 2025-05-07T20:32:37.1407520Z contiguous=True, 2025-05-07T20:32:37.1407603Z compiled=False, 2025-05-07T20:32:37.1407674Z ) 2025-05-07T20:32:37.1407886Z self = 2025-05-07T20:32:37.1408053Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1408102Z 2025-05-07T20:32:37.1408176Z @given( 2025-05-07T20:32:37.1408294Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1408394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1408509Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1408627Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1408743Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1408814Z ) 2025-05-07T20:32:37.1409058Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1409192Z def test_silu_mul_quant( 2025-05-07T20:32:37.1409266Z self, 2025-05-07T20:32:37.1409343Z T: int, 2025-05-07T20:32:37.1409417Z D: int, 2025-05-07T20:32:37.1409512Z scale_ub: Optional[float], 2025-05-07T20:32:37.1409603Z contiguous: bool, 2025-05-07T20:32:37.1409687Z compiled: bool, 2025-05-07T20:32:37.1409766Z ) -> None: 2025-05-07T20:32:37.1409863Z torch.manual_seed(2025) 2025-05-07T20:32:37.1409934Z 2025-05-07T20:32:37.1410096Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1410169Z 2025-05-07T20:32:37.1410260Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1410426Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1410516Z x = x_sign * x_clamp 2025-05-07T20:32:37.1410595Z x0 = x[:, :D] 2025-05-07T20:32:37.1410675Z x1 = x[:, D:] 2025-05-07T20:32:37.1410748Z 2025-05-07T20:32:37.1410830Z if contiguous: 2025-05-07T20:32:37.1410923Z x0 = x0.contiguous() 2025-05-07T20:32:37.1411012Z x1 = x1.contiguous() 2025-05-07T20:32:37.1411083Z 2025-05-07T20:32:37.1411175Z if scale_ub is not None: 2025-05-07T20:32:37.1411279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1411413Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1411494Z ) 2025-05-07T20:32:37.1411568Z else: 2025-05-07T20:32:37.1411662Z scale_ub_tensor = None 2025-05-07T20:32:37.1411733Z 2025-05-07T20:32:37.1411859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1411953Z op = silu_mul_quant 2025-05-07T20:32:37.1412036Z if compiled: 2025-05-07T20:32:37.1412175Z op = torch.compile(op) 2025-05-07T20:32:37.1412281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1412355Z 2025-05-07T20:32:37.1412444Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1412450Z 2025-05-07T20:32:37.1412547Z moe/activation_test.py:117: 2025-05-07T20:32:37.1412673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1412773Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1412870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1413357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1413457Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1413897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1414122Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1414460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1414553Z kernel = self.compile( 2025-05-07T20:32:37.1414933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1415104Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1415230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1415235Z 2025-05-07T20:32:37.1415437Z self = 2025-05-07T20:32:37.1416243Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1416742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d7b31a0>} 2025-05-07T20:32:37.1417598Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1417785Z context = 2025-05-07T20:32:37.1417793Z 2025-05-07T20:32:37.1417955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1418212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1418320Z module_map=module_map) 2025-05-07T20:32:37.1418478Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1418613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1418694Z E ^ 2025-05-07T20:32:37.1419042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1419050Z 2025-05-07T20:32:37.1419456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1419460Z 2025-05-07T20:32:37.1419563Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1419780Z self=, 2025-05-07T20:32:37.1419857Z T=128, 2025-05-07T20:32:37.1419931Z D=7168, 2025-05-07T20:32:37.1420013Z scale_ub=None, 2025-05-07T20:32:37.1420098Z contiguous=True, 2025-05-07T20:32:37.1420182Z compiled=False, 2025-05-07T20:32:37.1420254Z ) 2025-05-07T20:32:37.1420472Z self = 2025-05-07T20:32:37.1420638Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1420642Z 2025-05-07T20:32:37.1420759Z @given( 2025-05-07T20:32:37.1420876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1420973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1421092Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1421206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1421318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1421393Z ) 2025-05-07T20:32:37.1421632Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1421724Z def test_silu_mul_quant( 2025-05-07T20:32:37.1421803Z self, 2025-05-07T20:32:37.1421878Z T: int, 2025-05-07T20:32:37.1421955Z D: int, 2025-05-07T20:32:37.1422051Z scale_ub: Optional[float], 2025-05-07T20:32:37.1422137Z contiguous: bool, 2025-05-07T20:32:37.1422224Z compiled: bool, 2025-05-07T20:32:37.1422302Z ) -> None: 2025-05-07T20:32:37.1422398Z torch.manual_seed(2025) 2025-05-07T20:32:37.1422470Z 2025-05-07T20:32:37.1422634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1422708Z 2025-05-07T20:32:37.1422800Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1422921Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1423008Z x = x_sign * x_clamp 2025-05-07T20:32:37.1423090Z x0 = x[:, :D] 2025-05-07T20:32:37.1423168Z x1 = x[:, D:] 2025-05-07T20:32:37.1423238Z 2025-05-07T20:32:37.1423322Z if contiguous: 2025-05-07T20:32:37.1423411Z x0 = x0.contiguous() 2025-05-07T20:32:37.1423547Z x1 = x1.contiguous() 2025-05-07T20:32:37.1423617Z 2025-05-07T20:32:37.1423706Z if scale_ub is not None: 2025-05-07T20:32:37.1423811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1423946Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1424021Z ) 2025-05-07T20:32:37.1424101Z else: 2025-05-07T20:32:37.1424192Z scale_ub_tensor = None 2025-05-07T20:32:37.1424262Z 2025-05-07T20:32:37.1424433Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1424520Z op = silu_mul_quant 2025-05-07T20:32:37.1424603Z if compiled: 2025-05-07T20:32:37.1424704Z op = torch.compile(op) 2025-05-07T20:32:37.1424807Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1424881Z 2025-05-07T20:32:37.1424970Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1424975Z 2025-05-07T20:32:37.1425071Z moe/activation_test.py:117: 2025-05-07T20:32:37.1425203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1425303Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1425401Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1425935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1426032Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1426389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1426609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1426942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1427035Z kernel = self.compile( 2025-05-07T20:32:37.1427412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1427585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1427714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1427718Z 2025-05-07T20:32:37.1427959Z self = 2025-05-07T20:32:37.1428721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1429217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779ced4040>} 2025-05-07T20:32:37.1429949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1430141Z context = 2025-05-07T20:32:37.1430145Z 2025-05-07T20:32:37.1430311Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1430572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1430678Z module_map=module_map) 2025-05-07T20:32:37.1430843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1430940Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1431014Z E ^ 2025-05-07T20:32:37.1431362Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1431367Z 2025-05-07T20:32:37.1431770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1431815Z 2025-05-07T20:32:37.1431916Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1432138Z self=, 2025-05-07T20:32:37.1432213Z T=2048, 2025-05-07T20:32:37.1432293Z D=7168, 2025-05-07T20:32:37.1432375Z scale_ub=1200.0, 2025-05-07T20:32:37.1432459Z contiguous=True, 2025-05-07T20:32:37.1432543Z compiled=False, 2025-05-07T20:32:37.1432614Z ) 2025-05-07T20:32:37.1432826Z self = 2025-05-07T20:32:37.1433038Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1433043Z 2025-05-07T20:32:37.1433117Z @given( 2025-05-07T20:32:37.1433233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1433333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1433446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1433565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1433677Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1433749Z ) 2025-05-07T20:32:37.1433993Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1434124Z def test_silu_mul_quant( 2025-05-07T20:32:37.1434206Z self, 2025-05-07T20:32:37.1434285Z T: int, 2025-05-07T20:32:37.1434360Z D: int, 2025-05-07T20:32:37.1434455Z scale_ub: Optional[float], 2025-05-07T20:32:37.1434548Z contiguous: bool, 2025-05-07T20:32:37.1434633Z compiled: bool, 2025-05-07T20:32:37.1434709Z ) -> None: 2025-05-07T20:32:37.1434806Z torch.manual_seed(2025) 2025-05-07T20:32:37.1434876Z 2025-05-07T20:32:37.1435044Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1436871Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1436880Z 2025-05-07T20:32:37.1437000Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1437007Z 2025-05-07T20:32:37.1437106Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1437325Z self=, 2025-05-07T20:32:37.1437402Z T=1, 2025-05-07T20:32:37.1437478Z D=5120, 2025-05-07T20:32:37.1437558Z scale_ub=1200.0, 2025-05-07T20:32:37.1437642Z contiguous=True, 2025-05-07T20:32:37.1437723Z compiled=False, 2025-05-07T20:32:37.1437796Z ) 2025-05-07T20:32:37.1438011Z self = 2025-05-07T20:32:37.1438170Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1438174Z 2025-05-07T20:32:37.1438250Z @given( 2025-05-07T20:32:37.1438371Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1438469Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1438585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1438701Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1438811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1438886Z ) 2025-05-07T20:32:37.1439124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1439218Z def test_silu_mul_quant( 2025-05-07T20:32:37.1439292Z self, 2025-05-07T20:32:37.1439368Z T: int, 2025-05-07T20:32:37.1439490Z D: int, 2025-05-07T20:32:37.1439587Z scale_ub: Optional[float], 2025-05-07T20:32:37.1439673Z contiguous: bool, 2025-05-07T20:32:37.1439760Z compiled: bool, 2025-05-07T20:32:37.1439835Z ) -> None: 2025-05-07T20:32:37.1439927Z torch.manual_seed(2025) 2025-05-07T20:32:37.1440003Z 2025-05-07T20:32:37.1440169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1440240Z 2025-05-07T20:32:37.1440333Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1440455Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1440584Z x = x_sign * x_clamp 2025-05-07T20:32:37.1440669Z x0 = x[:, :D] 2025-05-07T20:32:37.1440748Z x1 = x[:, D:] 2025-05-07T20:32:37.1440822Z 2025-05-07T20:32:37.1440904Z if contiguous: 2025-05-07T20:32:37.1440993Z x0 = x0.contiguous() 2025-05-07T20:32:37.1441086Z x1 = x1.contiguous() 2025-05-07T20:32:37.1441155Z 2025-05-07T20:32:37.1441247Z if scale_ub is not None: 2025-05-07T20:32:37.1441352Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1441487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1441561Z ) 2025-05-07T20:32:37.1441639Z else: 2025-05-07T20:32:37.1441772Z scale_ub_tensor = None 2025-05-07T20:32:37.1441843Z 2025-05-07T20:32:37.1441976Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1442064Z op = silu_mul_quant 2025-05-07T20:32:37.1442151Z if compiled: 2025-05-07T20:32:37.1442249Z op = torch.compile(op) 2025-05-07T20:32:37.1442352Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1442426Z 2025-05-07T20:32:37.1442515Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1442519Z 2025-05-07T20:32:37.1442613Z moe/activation_test.py:117: 2025-05-07T20:32:37.1442744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1442844Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1442941Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1443432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1443529Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1443932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1444152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1444488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1444582Z kernel = self.compile( 2025-05-07T20:32:37.1444959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1445131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1445263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1445267Z 2025-05-07T20:32:37.1445492Z self = 2025-05-07T20:32:37.1446283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1446780Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779ced5580>} 2025-05-07T20:32:37.1447515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1447769Z context = 2025-05-07T20:32:37.1447774Z 2025-05-07T20:32:37.1447936Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1448197Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1448305Z module_map=module_map) 2025-05-07T20:32:37.1448467Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1448562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1448679Z E ^ 2025-05-07T20:32:37.1449025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1449030Z 2025-05-07T20:32:37.1449434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1449438Z 2025-05-07T20:32:37.1449545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1449768Z self=, 2025-05-07T20:32:37.1449843Z T=2048, 2025-05-07T20:32:37.1449919Z D=5120, 2025-05-07T20:32:37.1449999Z scale_ub=None, 2025-05-07T20:32:37.1450082Z contiguous=True, 2025-05-07T20:32:37.1450205Z compiled=False, 2025-05-07T20:32:37.1450277Z ) 2025-05-07T20:32:37.1450493Z self = 2025-05-07T20:32:37.1450665Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1450672Z 2025-05-07T20:32:37.1450747Z @given( 2025-05-07T20:32:37.1450863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1450963Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1451075Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1451197Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1451307Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1451381Z ) 2025-05-07T20:32:37.1451623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1451715Z def test_silu_mul_quant( 2025-05-07T20:32:37.1451792Z self, 2025-05-07T20:32:37.1451871Z T: int, 2025-05-07T20:32:37.1451948Z D: int, 2025-05-07T20:32:37.1452086Z scale_ub: Optional[float], 2025-05-07T20:32:37.1452176Z contiguous: bool, 2025-05-07T20:32:37.1452259Z compiled: bool, 2025-05-07T20:32:37.1452338Z ) -> None: 2025-05-07T20:32:37.1452433Z torch.manual_seed(2025) 2025-05-07T20:32:37.1452503Z 2025-05-07T20:32:37.1452669Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1452740Z 2025-05-07T20:32:37.1452829Z > x_sign = torch.sign(x) 2025-05-07T20:32:37.1454667Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1454677Z 2025-05-07T20:32:37.1454792Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:37.1454799Z 2025-05-07T20:32:37.1454903Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1455121Z self=, 2025-05-07T20:32:37.1455198Z T=16384, 2025-05-07T20:32:37.1455277Z D=5120, 2025-05-07T20:32:37.1455357Z scale_ub=None, 2025-05-07T20:32:37.1455442Z contiguous=True, 2025-05-07T20:32:37.1455531Z compiled=False, 2025-05-07T20:32:37.1455621Z ) 2025-05-07T20:32:37.1455906Z self = 2025-05-07T20:32:37.1456076Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1456081Z 2025-05-07T20:32:37.1456155Z @given( 2025-05-07T20:32:37.1456277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1456376Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1456489Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1456607Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1456760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1456832Z ) 2025-05-07T20:32:37.1457081Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1457171Z def test_silu_mul_quant( 2025-05-07T20:32:37.1457250Z self, 2025-05-07T20:32:37.1457325Z T: int, 2025-05-07T20:32:37.1457400Z D: int, 2025-05-07T20:32:37.1457503Z scale_ub: Optional[float], 2025-05-07T20:32:37.1457590Z contiguous: bool, 2025-05-07T20:32:37.1457673Z compiled: bool, 2025-05-07T20:32:37.1457752Z ) -> None: 2025-05-07T20:32:37.1457843Z torch.manual_seed(2025) 2025-05-07T20:32:37.1457918Z 2025-05-07T20:32:37.1458124Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1459853Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1459863Z 2025-05-07T20:32:37.1459979Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1459984Z 2025-05-07T20:32:37.1460083Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1460302Z self=, 2025-05-07T20:32:37.1460378Z T=4096, 2025-05-07T20:32:37.1460455Z D=5120, 2025-05-07T20:32:37.1460577Z scale_ub=None, 2025-05-07T20:32:37.1460661Z contiguous=True, 2025-05-07T20:32:37.1460742Z compiled=False, 2025-05-07T20:32:37.1460817Z ) 2025-05-07T20:32:37.1461031Z self = 2025-05-07T20:32:37.1461195Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1461200Z 2025-05-07T20:32:37.1461276Z @given( 2025-05-07T20:32:37.1461392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1461492Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1461602Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1461719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1461832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1461904Z ) 2025-05-07T20:32:37.1462150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1465089Z def test_silu_mul_quant( 2025-05-07T20:32:37.1465181Z self, 2025-05-07T20:32:37.1465262Z T: int, 2025-05-07T20:32:37.1465339Z D: int, 2025-05-07T20:32:37.1465442Z scale_ub: Optional[float], 2025-05-07T20:32:37.1465535Z contiguous: bool, 2025-05-07T20:32:37.1465619Z compiled: bool, 2025-05-07T20:32:37.1465698Z ) -> None: 2025-05-07T20:32:37.1465794Z torch.manual_seed(2025) 2025-05-07T20:32:37.1465866Z 2025-05-07T20:32:37.1466036Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1467776Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1467888Z 2025-05-07T20:32:37.1468008Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1468013Z 2025-05-07T20:32:37.1468113Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1468331Z self=, 2025-05-07T20:32:37.1468409Z T=2048, 2025-05-07T20:32:37.1468483Z D=5120, 2025-05-07T20:32:37.1468562Z scale_ub=None, 2025-05-07T20:32:37.1468653Z contiguous=False, 2025-05-07T20:32:37.1468738Z compiled=False, 2025-05-07T20:32:37.1468808Z ) 2025-05-07T20:32:37.1469025Z self = 2025-05-07T20:32:37.1469195Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1469199Z 2025-05-07T20:32:37.1469329Z @given( 2025-05-07T20:32:37.1469450Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1469548Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1469663Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1469780Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1469891Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1469966Z ) 2025-05-07T20:32:37.1470207Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1470299Z def test_silu_mul_quant( 2025-05-07T20:32:37.1470376Z self, 2025-05-07T20:32:37.1470451Z T: int, 2025-05-07T20:32:37.1470528Z D: int, 2025-05-07T20:32:37.1470628Z scale_ub: Optional[float], 2025-05-07T20:32:37.1470714Z contiguous: bool, 2025-05-07T20:32:37.1470799Z compiled: bool, 2025-05-07T20:32:37.1470876Z ) -> None: 2025-05-07T20:32:37.1470973Z torch.manual_seed(2025) 2025-05-07T20:32:37.1471047Z 2025-05-07T20:32:37.1471253Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1472977Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1472991Z 2025-05-07T20:32:37.1473106Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1473111Z 2025-05-07T20:32:37.1473211Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1473434Z self=, 2025-05-07T20:32:37.1473508Z T=4096, 2025-05-07T20:32:37.1473584Z D=7168, 2025-05-07T20:32:37.1473667Z scale_ub=None, 2025-05-07T20:32:37.1473750Z contiguous=True, 2025-05-07T20:32:37.1473837Z compiled=True, 2025-05-07T20:32:37.1473908Z ) 2025-05-07T20:32:37.1474120Z self = 2025-05-07T20:32:37.1474288Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.1474292Z 2025-05-07T20:32:37.1474366Z @given( 2025-05-07T20:32:37.1474482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1474582Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1474737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1474853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1474969Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1475042Z ) 2025-05-07T20:32:37.1475289Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1475381Z def test_silu_mul_quant( 2025-05-07T20:32:37.1475455Z self, 2025-05-07T20:32:37.1475534Z T: int, 2025-05-07T20:32:37.1475650Z D: int, 2025-05-07T20:32:37.1475749Z scale_ub: Optional[float], 2025-05-07T20:32:37.1475840Z contiguous: bool, 2025-05-07T20:32:37.1475925Z compiled: bool, 2025-05-07T20:32:37.1476002Z ) -> None: 2025-05-07T20:32:37.1476098Z torch.manual_seed(2025) 2025-05-07T20:32:37.1476170Z 2025-05-07T20:32:37.1476333Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1478101Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1478110Z 2025-05-07T20:32:37.1478225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1478229Z 2025-05-07T20:32:37.1478328Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1478543Z self=, 2025-05-07T20:32:37.1478623Z T=2048, 2025-05-07T20:32:37.1478698Z D=5120, 2025-05-07T20:32:37.1478778Z scale_ub=1200.0, 2025-05-07T20:32:37.1478867Z contiguous=False, 2025-05-07T20:32:37.1478950Z compiled=False, 2025-05-07T20:32:37.1479021Z ) 2025-05-07T20:32:37.1479236Z self = 2025-05-07T20:32:37.1479410Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1479415Z 2025-05-07T20:32:37.1479562Z @given( 2025-05-07T20:32:37.1479681Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1479778Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1479897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1480011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1480129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1480202Z ) 2025-05-07T20:32:37.1480440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1480535Z def test_silu_mul_quant( 2025-05-07T20:32:37.1480611Z self, 2025-05-07T20:32:37.1480687Z T: int, 2025-05-07T20:32:37.1480764Z D: int, 2025-05-07T20:32:37.1480860Z scale_ub: Optional[float], 2025-05-07T20:32:37.1480948Z contiguous: bool, 2025-05-07T20:32:37.1481033Z compiled: bool, 2025-05-07T20:32:37.1481112Z ) -> None: 2025-05-07T20:32:37.1481212Z torch.manual_seed(2025) 2025-05-07T20:32:37.1481289Z 2025-05-07T20:32:37.1481451Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1483170Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1483221Z 2025-05-07T20:32:37.1483334Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1483339Z 2025-05-07T20:32:37.1483444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1483664Z self=, 2025-05-07T20:32:37.1483739Z T=4096, 2025-05-07T20:32:37.1483816Z D=7168, 2025-05-07T20:32:37.1483895Z scale_ub=1200.0, 2025-05-07T20:32:37.1484018Z contiguous=True, 2025-05-07T20:32:37.1484102Z compiled=False, 2025-05-07T20:32:37.1484173Z ) 2025-05-07T20:32:37.1484383Z self = 2025-05-07T20:32:37.1484557Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1484561Z 2025-05-07T20:32:37.1484636Z @given( 2025-05-07T20:32:37.1484758Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1484856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1484967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1485083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1485238Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1485311Z ) 2025-05-07T20:32:37.1485588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1485700Z def test_silu_mul_quant( 2025-05-07T20:32:37.1485780Z self, 2025-05-07T20:32:37.1485856Z T: int, 2025-05-07T20:32:37.1485931Z D: int, 2025-05-07T20:32:37.1486029Z scale_ub: Optional[float], 2025-05-07T20:32:37.1486116Z contiguous: bool, 2025-05-07T20:32:37.1486198Z compiled: bool, 2025-05-07T20:32:37.1486277Z ) -> None: 2025-05-07T20:32:37.1486369Z torch.manual_seed(2025) 2025-05-07T20:32:37.1486440Z 2025-05-07T20:32:37.1486609Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1488373Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1488381Z 2025-05-07T20:32:37.1488497Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1488502Z 2025-05-07T20:32:37.1488601Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1488818Z self=, 2025-05-07T20:32:37.1488895Z T=16384, 2025-05-07T20:32:37.1488971Z D=7168, 2025-05-07T20:32:37.1489054Z scale_ub=None, 2025-05-07T20:32:37.1489138Z contiguous=False, 2025-05-07T20:32:37.1489218Z compiled=True, 2025-05-07T20:32:37.1489294Z ) 2025-05-07T20:32:37.1489508Z self = 2025-05-07T20:32:37.1489680Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:37.1489685Z 2025-05-07T20:32:37.1489764Z @given( 2025-05-07T20:32:37.1489879Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1489982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1490093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1490206Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1490322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1490394Z ) 2025-05-07T20:32:37.1490630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1490768Z def test_silu_mul_quant( 2025-05-07T20:32:37.1490842Z self, 2025-05-07T20:32:37.1490916Z T: int, 2025-05-07T20:32:37.1490994Z D: int, 2025-05-07T20:32:37.1491090Z scale_ub: Optional[float], 2025-05-07T20:32:37.1491179Z contiguous: bool, 2025-05-07T20:32:37.1491264Z compiled: bool, 2025-05-07T20:32:37.1491342Z ) -> None: 2025-05-07T20:32:37.1491438Z torch.manual_seed(2025) 2025-05-07T20:32:37.1491509Z 2025-05-07T20:32:37.1491713Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1493431Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1493441Z 2025-05-07T20:32:37.1493554Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1493597Z 2025-05-07T20:32:37.1493803Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1494021Z self=, 2025-05-07T20:32:37.1494096Z T=4096, 2025-05-07T20:32:37.1494175Z D=7168, 2025-05-07T20:32:37.1494255Z scale_ub=None, 2025-05-07T20:32:37.1494338Z contiguous=True, 2025-05-07T20:32:37.1494423Z compiled=False, 2025-05-07T20:32:37.1494493Z ) 2025-05-07T20:32:37.1494709Z self = 2025-05-07T20:32:37.1494873Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1494878Z 2025-05-07T20:32:37.1494954Z @given( 2025-05-07T20:32:37.1495071Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1495169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1495280Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1495411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1495589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1495676Z ) 2025-05-07T20:32:37.1495917Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1496010Z def test_silu_mul_quant( 2025-05-07T20:32:37.1496086Z self, 2025-05-07T20:32:37.1496161Z T: int, 2025-05-07T20:32:37.1496234Z D: int, 2025-05-07T20:32:37.1496336Z scale_ub: Optional[float], 2025-05-07T20:32:37.1496424Z contiguous: bool, 2025-05-07T20:32:37.1496507Z compiled: bool, 2025-05-07T20:32:37.1496586Z ) -> None: 2025-05-07T20:32:37.1496677Z torch.manual_seed(2025) 2025-05-07T20:32:37.1496751Z 2025-05-07T20:32:37.1496917Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1498911Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1498921Z 2025-05-07T20:32:37.1499039Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1499045Z 2025-05-07T20:32:37.1499144Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1499366Z self=, 2025-05-07T20:32:37.1499520Z T=16384, 2025-05-07T20:32:37.1499594Z D=7168, 2025-05-07T20:32:37.1499677Z scale_ub=None, 2025-05-07T20:32:37.1499759Z contiguous=True, 2025-05-07T20:32:37.1499841Z compiled=False, 2025-05-07T20:32:37.1499914Z ) 2025-05-07T20:32:37.1500131Z self = 2025-05-07T20:32:37.1500301Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:37.1500305Z 2025-05-07T20:32:37.1500447Z @given( 2025-05-07T20:32:37.1500563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1500662Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1500774Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1500889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1501004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1501077Z ) 2025-05-07T20:32:37.1501319Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1501416Z def test_silu_mul_quant( 2025-05-07T20:32:37.1501491Z self, 2025-05-07T20:32:37.1501565Z T: int, 2025-05-07T20:32:37.1501643Z D: int, 2025-05-07T20:32:37.1501807Z scale_ub: Optional[float], 2025-05-07T20:32:37.1501899Z contiguous: bool, 2025-05-07T20:32:37.1501985Z compiled: bool, 2025-05-07T20:32:37.1502061Z ) -> None: 2025-05-07T20:32:37.1502157Z torch.manual_seed(2025) 2025-05-07T20:32:37.1502233Z 2025-05-07T20:32:37.1502396Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1504120Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1504129Z 2025-05-07T20:32:37.1504245Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1504308Z 2025-05-07T20:32:37.1504414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1504630Z self=, 2025-05-07T20:32:37.1504710Z T=16384, 2025-05-07T20:32:37.1504786Z D=7168, 2025-05-07T20:32:37.1504867Z scale_ub=1200.0, 2025-05-07T20:32:37.1504951Z contiguous=True, 2025-05-07T20:32:37.1505035Z compiled=False, 2025-05-07T20:32:37.1505107Z ) 2025-05-07T20:32:37.1505319Z self = 2025-05-07T20:32:37.1505490Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1505497Z 2025-05-07T20:32:37.1505572Z @given( 2025-05-07T20:32:37.1505696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1505793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1505908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1506034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1506144Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1506217Z ) 2025-05-07T20:32:37.1506461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1506552Z def test_silu_mul_quant( 2025-05-07T20:32:37.1506630Z self, 2025-05-07T20:32:37.1506705Z T: int, 2025-05-07T20:32:37.1506778Z D: int, 2025-05-07T20:32:37.1506878Z scale_ub: Optional[float], 2025-05-07T20:32:37.1506965Z contiguous: bool, 2025-05-07T20:32:37.1507050Z compiled: bool, 2025-05-07T20:32:37.1507175Z ) -> None: 2025-05-07T20:32:37.1507266Z torch.manual_seed(2025) 2025-05-07T20:32:37.1507337Z 2025-05-07T20:32:37.1507503Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1509227Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1509274Z 2025-05-07T20:32:37.1509393Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1509397Z 2025-05-07T20:32:37.1509497Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1509716Z self=, 2025-05-07T20:32:37.1509791Z T=128, 2025-05-07T20:32:37.1509865Z D=5120, 2025-05-07T20:32:37.1509950Z scale_ub=1200.0, 2025-05-07T20:32:37.1510034Z contiguous=False, 2025-05-07T20:32:37.1510180Z compiled=False, 2025-05-07T20:32:37.1510255Z ) 2025-05-07T20:32:37.1510468Z self = 2025-05-07T20:32:37.1510635Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:37.1510642Z 2025-05-07T20:32:37.1510721Z @given( 2025-05-07T20:32:37.1510835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1510934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1511047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1511161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1511276Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1511352Z ) 2025-05-07T20:32:37.1511589Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1511683Z def test_silu_mul_quant( 2025-05-07T20:32:37.1511757Z self, 2025-05-07T20:32:37.1511836Z T: int, 2025-05-07T20:32:37.1511915Z D: int, 2025-05-07T20:32:37.1512051Z scale_ub: Optional[float], 2025-05-07T20:32:37.1512139Z contiguous: bool, 2025-05-07T20:32:37.1512226Z compiled: bool, 2025-05-07T20:32:37.1512303Z ) -> None: 2025-05-07T20:32:37.1512398Z torch.manual_seed(2025) 2025-05-07T20:32:37.1512469Z 2025-05-07T20:32:37.1512634Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1512709Z 2025-05-07T20:32:37.1512798Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1512920Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1513010Z x = x_sign * x_clamp 2025-05-07T20:32:37.1513091Z x0 = x[:, :D] 2025-05-07T20:32:37.1513169Z x1 = x[:, D:] 2025-05-07T20:32:37.1513244Z 2025-05-07T20:32:37.1513325Z if contiguous: 2025-05-07T20:32:37.1513415Z x0 = x0.contiguous() 2025-05-07T20:32:37.1513507Z x1 = x1.contiguous() 2025-05-07T20:32:37.1513580Z 2025-05-07T20:32:37.1513671Z if scale_ub is not None: 2025-05-07T20:32:37.1513777Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1513909Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1513990Z ) 2025-05-07T20:32:37.1514066Z else: 2025-05-07T20:32:37.1514158Z scale_ub_tensor = None 2025-05-07T20:32:37.1514231Z 2025-05-07T20:32:37.1514359Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1514447Z op = silu_mul_quant 2025-05-07T20:32:37.1514533Z if compiled: 2025-05-07T20:32:37.1514631Z op = torch.compile(op) 2025-05-07T20:32:37.1514779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1514853Z 2025-05-07T20:32:37.1514941Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1514945Z 2025-05-07T20:32:37.1515043Z moe/activation_test.py:117: 2025-05-07T20:32:37.1515172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1515273Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1515373Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1515867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1516005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1516365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1516583Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1516925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1517017Z kernel = self.compile( 2025-05-07T20:32:37.1517395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1517614Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1517741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1517745Z 2025-05-07T20:32:37.1517949Z self = 2025-05-07T20:32:37.1518714Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1519207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779cfc11c0>} 2025-05-07T20:32:37.1519951Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1520190Z context = 2025-05-07T20:32:37.1520195Z 2025-05-07T20:32:37.1520362Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1520620Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1520726Z module_map=module_map) 2025-05-07T20:32:37.1520891Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1520987Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1521063Z E ^ 2025-05-07T20:32:37.1521415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1521422Z 2025-05-07T20:32:37.1521828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1521832Z 2025-05-07T20:32:37.1521939Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1522161Z self=, 2025-05-07T20:32:37.1522236Z T=2048, 2025-05-07T20:32:37.1522314Z D=7168, 2025-05-07T20:32:37.1522396Z scale_ub=None, 2025-05-07T20:32:37.1522482Z contiguous=False, 2025-05-07T20:32:37.1522569Z compiled=False, 2025-05-07T20:32:37.1522639Z ) 2025-05-07T20:32:37.1522857Z self = 2025-05-07T20:32:37.1523026Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:37.1523030Z 2025-05-07T20:32:37.1523103Z @given( 2025-05-07T20:32:37.1523268Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1523365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1523476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1523594Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1523709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1523783Z ) 2025-05-07T20:32:37.1524025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1524158Z def test_silu_mul_quant( 2025-05-07T20:32:37.1524236Z self, 2025-05-07T20:32:37.1524311Z T: int, 2025-05-07T20:32:37.1524386Z D: int, 2025-05-07T20:32:37.1524486Z scale_ub: Optional[float], 2025-05-07T20:32:37.1524572Z contiguous: bool, 2025-05-07T20:32:37.1524656Z compiled: bool, 2025-05-07T20:32:37.1524736Z ) -> None: 2025-05-07T20:32:37.1524829Z torch.manual_seed(2025) 2025-05-07T20:32:37.1524902Z 2025-05-07T20:32:37.1525069Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1526839Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1526848Z 2025-05-07T20:32:37.1526969Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1526974Z 2025-05-07T20:32:37.1527076Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1527293Z self=, 2025-05-07T20:32:37.1527371Z T=128, 2025-05-07T20:32:37.1527451Z D=7168, 2025-05-07T20:32:37.1527530Z scale_ub=1200.0, 2025-05-07T20:32:37.1527614Z contiguous=True, 2025-05-07T20:32:37.1527695Z compiled=True, 2025-05-07T20:32:37.1527767Z ) 2025-05-07T20:32:37.1527985Z self = 2025-05-07T20:32:37.1528188Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1528193Z 2025-05-07T20:32:37.1528268Z @given( 2025-05-07T20:32:37.1528389Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1528486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1528597Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1528714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1528824Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1528900Z ) 2025-05-07T20:32:37.1529140Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1529233Z def test_silu_mul_quant( 2025-05-07T20:32:37.1529311Z self, 2025-05-07T20:32:37.1529386Z T: int, 2025-05-07T20:32:37.1529461Z D: int, 2025-05-07T20:32:37.1529559Z scale_ub: Optional[float], 2025-05-07T20:32:37.1529648Z contiguous: bool, 2025-05-07T20:32:37.1529736Z compiled: bool, 2025-05-07T20:32:37.1529814Z ) -> None: 2025-05-07T20:32:37.1529906Z torch.manual_seed(2025) 2025-05-07T20:32:37.1529978Z 2025-05-07T20:32:37.1530142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1530213Z 2025-05-07T20:32:37.1530307Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1530431Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1530518Z x = x_sign * x_clamp 2025-05-07T20:32:37.1530600Z x0 = x[:, :D] 2025-05-07T20:32:37.1530678Z x1 = x[:, D:] 2025-05-07T20:32:37.1530795Z 2025-05-07T20:32:37.1530879Z if contiguous: 2025-05-07T20:32:37.1530969Z x0 = x0.contiguous() 2025-05-07T20:32:37.1531057Z x1 = x1.contiguous() 2025-05-07T20:32:37.1531129Z 2025-05-07T20:32:37.1531217Z if scale_ub is not None: 2025-05-07T20:32:37.1531321Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:37.1531460Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:37.1531535Z ) 2025-05-07T20:32:37.1531610Z else: 2025-05-07T20:32:37.1531748Z scale_ub_tensor = None 2025-05-07T20:32:37.1531819Z 2025-05-07T20:32:37.1531949Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:37.1532037Z op = silu_mul_quant 2025-05-07T20:32:37.1532119Z if compiled: 2025-05-07T20:32:37.1532221Z op = torch.compile(op) 2025-05-07T20:32:37.1532324Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1532396Z 2025-05-07T20:32:37.1532490Z > y_fp8, y_scale = fn() 2025-05-07T20:32:37.1532494Z 2025-05-07T20:32:37.1532589Z moe/activation_test.py:117: 2025-05-07T20:32:37.1532717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1532818Z moe/activation_test.py:115: in fn 2025-05-07T20:32:37.1532957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:37.1533328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:37.1533420Z return fn(*args, **kwargs) 2025-05-07T20:32:37.1533977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:37.1534078Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:37.1534431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:37.1534651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:37.1534989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:37.1535081Z kernel = self.compile( 2025-05-07T20:32:37.1535489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:37.1535728Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:37.1535856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:37.1535866Z 2025-05-07T20:32:37.1536070Z self = 2025-05-07T20:32:37.1536830Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:37.1537330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f779d0abb00>} 2025-05-07T20:32:37.1538072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:37.1538262Z context = 2025-05-07T20:32:37.1538266Z 2025-05-07T20:32:37.1538431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:37.1538685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:37.1538793Z module_map=module_map) 2025-05-07T20:32:37.1538952Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:37.1539047Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:37.1539169Z E ^ 2025-05-07T20:32:37.1539515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:37.1539520Z 2025-05-07T20:32:37.1539931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:37.1539935Z 2025-05-07T20:32:37.1540038Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1540257Z self=, 2025-05-07T20:32:37.1540404Z T=128, 2025-05-07T20:32:37.1540478Z D=7168, 2025-05-07T20:32:37.1540561Z scale_ub=1200.0, 2025-05-07T20:32:37.1540646Z contiguous=True, 2025-05-07T20:32:37.1540727Z compiled=False, 2025-05-07T20:32:37.1540802Z ) 2025-05-07T20:32:37.1541015Z self = 2025-05-07T20:32:37.1541181Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:37.1541187Z 2025-05-07T20:32:37.1541264Z @given( 2025-05-07T20:32:37.1541381Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1541477Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1541592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1541748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1541864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1541941Z ) 2025-05-07T20:32:37.1542182Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1542282Z def test_silu_mul_quant( 2025-05-07T20:32:37.1542356Z self, 2025-05-07T20:32:37.1542430Z T: int, 2025-05-07T20:32:37.1542507Z D: int, 2025-05-07T20:32:37.1542603Z scale_ub: Optional[float], 2025-05-07T20:32:37.1542690Z contiguous: bool, 2025-05-07T20:32:37.1542778Z compiled: bool, 2025-05-07T20:32:37.1542857Z ) -> None: 2025-05-07T20:32:37.1542952Z torch.manual_seed(2025) 2025-05-07T20:32:37.1543025Z 2025-05-07T20:32:37.1543187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1543258Z 2025-05-07T20:32:37.1543354Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1543481Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1545257Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1545266Z 2025-05-07T20:32:37.1545403Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.1545408Z 2025-05-07T20:32:37.1545519Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1545757Z self=, 2025-05-07T20:32:37.1545832Z T=128, 2025-05-07T20:32:37.1545911Z D=5120, 2025-05-07T20:32:37.1545992Z scale_ub=1200.0, 2025-05-07T20:32:37.1546078Z contiguous=True, 2025-05-07T20:32:37.1546161Z compiled=True, 2025-05-07T20:32:37.1546233Z ) 2025-05-07T20:32:37.1546446Z self = 2025-05-07T20:32:37.1546611Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:37.1546616Z 2025-05-07T20:32:37.1546689Z @given( 2025-05-07T20:32:37.1546807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1546903Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1547015Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1547236Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1547346Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1547419Z ) 2025-05-07T20:32:37.1547661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1547758Z def test_silu_mul_quant( 2025-05-07T20:32:37.1547834Z self, 2025-05-07T20:32:37.1547913Z T: int, 2025-05-07T20:32:37.1547988Z D: int, 2025-05-07T20:32:37.1548086Z scale_ub: Optional[float], 2025-05-07T20:32:37.1548213Z contiguous: bool, 2025-05-07T20:32:37.1548296Z compiled: bool, 2025-05-07T20:32:37.1548374Z ) -> None: 2025-05-07T20:32:37.1548467Z torch.manual_seed(2025) 2025-05-07T20:32:37.1548538Z 2025-05-07T20:32:37.1548702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1548773Z 2025-05-07T20:32:37.1548862Z x_sign = torch.sign(x) 2025-05-07T20:32:37.1548991Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:37.1550751Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1550762Z 2025-05-07T20:32:37.1550881Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:37.1550885Z 2025-05-07T20:32:37.1550984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:37.1551204Z self=, 2025-05-07T20:32:37.1551280Z T=128, 2025-05-07T20:32:37.1551360Z D=7168, 2025-05-07T20:32:37.1551442Z scale_ub=None, 2025-05-07T20:32:37.1551525Z contiguous=True, 2025-05-07T20:32:37.1551607Z compiled=True, 2025-05-07T20:32:37.1551681Z ) 2025-05-07T20:32:37.1551894Z self = 2025-05-07T20:32:37.1552098Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:37.1552103Z 2025-05-07T20:32:37.1552180Z @given( 2025-05-07T20:32:37.1552295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:37.1552396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:37.1552507Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:37.1552620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:37.1552734Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:37.1552805Z ) 2025-05-07T20:32:37.1553043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:37.1553140Z def test_silu_mul_quant( 2025-05-07T20:32:37.1553213Z self, 2025-05-07T20:32:37.1553287Z T: int, 2025-05-07T20:32:37.1553368Z D: int, 2025-05-07T20:32:37.1553462Z scale_ub: Optional[float], 2025-05-07T20:32:37.1553549Z contiguous: bool, 2025-05-07T20:32:37.1553638Z compiled: bool, 2025-05-07T20:32:37.1553717Z ) -> None: 2025-05-07T20:32:37.1553813Z torch.manual_seed(2025) 2025-05-07T20:32:37.1553884Z 2025-05-07T20:32:37.1554045Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:37.1555766Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:37.1555815Z 2025-05-07T20:32:37.1555931Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:37.1556073Z =============================== warnings summary =============================== 2025-05-07T20:32:37.1556374Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.1556713Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.1557009Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:37.1557863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:37.1558095Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:37.1558099Z 2025-05-07T20:32:37.1558343Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:37.1558509Z ================= 1 failed, 1 deselected, 3 warnings in 12.04s ================= 2025-05-07T20:32:38.7671213Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:38.8302555Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:32:38.8302806Z 2025-05-07T20:32:40.8320823Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:43.0063291Z ============================= test session starts ============================== 2025-05-07T20:32:43.0064275Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:43.0065168Z cachedir: .pytest_cache 2025-05-07T20:32:43.0066449Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:43.0067664Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:43.0068362Z plugins: hypothesis-6.131.14 2025-05-07T20:32:44.5497504Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:44.6454454Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:44.6455253Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:44.6455684Z 2025-05-07T20:32:46.7481747Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7482715Z self=, 2025-05-07T20:32:46.7483225Z T=1, 2025-05-07T20:32:46.7483419Z D=5120, 2025-05-07T20:32:46.7483607Z scale_ub=None, 2025-05-07T20:32:46.7483819Z contiguous=True, 2025-05-07T20:32:46.7484051Z compiled=True, 2025-05-07T20:32:46.7484255Z ) 2025-05-07T20:32:46.7484579Z self = 2025-05-07T20:32:46.7485061Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.7485322Z 2025-05-07T20:32:46.7485400Z @given( 2025-05-07T20:32:46.7485638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7485952Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7486265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7486585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7486913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7487498Z ) 2025-05-07T20:32:46.7487840Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7488328Z def test_silu_mul_quant( 2025-05-07T20:32:46.7488575Z self, 2025-05-07T20:32:46.7488768Z T: int, 2025-05-07T20:32:46.7488969Z D: int, 2025-05-07T20:32:46.7489190Z scale_ub: Optional[float], 2025-05-07T20:32:46.7489456Z contiguous: bool, 2025-05-07T20:32:46.7489701Z compiled: bool, 2025-05-07T20:32:46.7490034Z ) -> None: 2025-05-07T20:32:46.7490243Z torch.manual_seed(2025) 2025-05-07T20:32:46.7490489Z 2025-05-07T20:32:46.7490764Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7491095Z 2025-05-07T20:32:46.7491292Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7491582Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7491900Z x = x_sign * x_clamp 2025-05-07T20:32:46.7492139Z x0 = x[:, :D] 2025-05-07T20:32:46.7492356Z x1 = x[:, D:] 2025-05-07T20:32:46.7492568Z 2025-05-07T20:32:46.7492752Z if contiguous: 2025-05-07T20:32:46.7492986Z x0 = x0.contiguous() 2025-05-07T20:32:46.7493249Z x1 = x1.contiguous() 2025-05-07T20:32:46.7493568Z 2025-05-07T20:32:46.7493894Z if scale_ub is not None: 2025-05-07T20:32:46.7494168Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7494495Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7494811Z ) 2025-05-07T20:32:46.7495008Z else: 2025-05-07T20:32:46.7495214Z scale_ub_tensor = None 2025-05-07T20:32:46.7495478Z 2025-05-07T20:32:46.7495733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7502315Z op = silu_mul_quant 2025-05-07T20:32:46.7502592Z if compiled: 2025-05-07T20:32:46.7502852Z op = torch.compile(op) 2025-05-07T20:32:46.7503159Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7503441Z 2025-05-07T20:32:46.7503642Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.7503936Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.7504230Z 2025-05-07T20:32:46.7504482Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7504957Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.7505249Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.7505563Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.7505932Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7506238Z 2025-05-07T20:32:46.7506442Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.7506635Z 2025-05-07T20:32:46.7506743Z moe/activation_test.py:126: 2025-05-07T20:32:46.7507036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7507378Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.7507712Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.7508498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.7509241Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.7509788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7510466Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7511148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.7511861Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.7512583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.7513292Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.7513887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.7514411Z fn() 2025-05-07T20:32:46.7514926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.7515509Z self.fn.run( 2025-05-07T20:32:46.7515969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7516575Z kernel = self.compile( 2025-05-07T20:32:46.7517121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7517765Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7518171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7518414Z 2025-05-07T20:32:46.7518622Z self = 2025-05-07T20:32:46.7519760Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7521129Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c614a700>} 2025-05-07T20:32:46.7522458Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7523470Z context = 2025-05-07T20:32:46.7523763Z 2025-05-07T20:32:46.7523931Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7524459Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7524921Z module_map=module_map) 2025-05-07T20:32:46.7525290Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7525698Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.7525972Z E ^ 2025-05-07T20:32:46.7526426Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7526876Z 2025-05-07T20:32:46.7527285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.7527800Z 2025-05-07T20:32:46.7527907Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.7528316Z self=, 2025-05-07T20:32:46.7528710Z T=2048, 2025-05-07T20:32:46.7528903Z D=5120, 2025-05-07T20:32:46.7529105Z scale_ub=1200.0, 2025-05-07T20:32:46.7529327Z contiguous=True, 2025-05-07T20:32:46.7529555Z compiled=False, 2025-05-07T20:32:46.7529767Z ) 2025-05-07T20:32:46.7530084Z self = 2025-05-07T20:32:46.7530581Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:46.7530851Z 2025-05-07T20:32:46.7530936Z @given( 2025-05-07T20:32:46.7531168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.7531476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.7531779Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.7532109Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.7532430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.7532718Z ) 2025-05-07T20:32:46.7533071Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.7533561Z def test_silu_mul_quant( 2025-05-07T20:32:46.7533910Z self, 2025-05-07T20:32:46.7534111Z T: int, 2025-05-07T20:32:46.7534305Z D: int, 2025-05-07T20:32:46.7534524Z scale_ub: Optional[float], 2025-05-07T20:32:46.7534797Z contiguous: bool, 2025-05-07T20:32:46.7535036Z compiled: bool, 2025-05-07T20:32:46.7535262Z ) -> None: 2025-05-07T20:32:46.7535477Z torch.manual_seed(2025) 2025-05-07T20:32:46.7535716Z 2025-05-07T20:32:46.7536039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.7536378Z 2025-05-07T20:32:46.7536572Z x_sign = torch.sign(x) 2025-05-07T20:32:46.7536857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.7537169Z x = x_sign * x_clamp 2025-05-07T20:32:46.7537413Z x0 = x[:, :D] 2025-05-07T20:32:46.7537629Z x1 = x[:, D:] 2025-05-07T20:32:46.7537837Z 2025-05-07T20:32:46.7538030Z if contiguous: 2025-05-07T20:32:46.7538263Z x0 = x0.contiguous() 2025-05-07T20:32:46.7538523Z x1 = x1.contiguous() 2025-05-07T20:32:46.7538765Z 2025-05-07T20:32:46.7538954Z if scale_ub is not None: 2025-05-07T20:32:46.7539278Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.7539617Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.7539920Z ) 2025-05-07T20:32:46.7540115Z else: 2025-05-07T20:32:46.7540328Z scale_ub_tensor = None 2025-05-07T20:32:46.7540578Z 2025-05-07T20:32:46.7540811Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.7541126Z op = silu_mul_quant 2025-05-07T20:32:46.7541384Z if compiled: 2025-05-07T20:32:46.7541627Z op = torch.compile(op) 2025-05-07T20:32:46.7541926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7542203Z 2025-05-07T20:32:46.7542393Z > y_fp8, y_scale = fn() 2025-05-07T20:32:46.7542565Z 2025-05-07T20:32:46.7542663Z moe/activation_test.py:117: 2025-05-07T20:32:46.7542960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7543284Z moe/activation_test.py:115: in fn 2025-05-07T20:32:46.7543571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.7544307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:46.7544994Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:46.7545525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.7546197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.7546858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.7547383Z kernel = self.compile( 2025-05-07T20:32:46.7547924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.7548573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.7548975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.7549203Z 2025-05-07T20:32:46.7549410Z self = 2025-05-07T20:32:46.7550473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.7551828Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c5ffa020>} 2025-05-07T20:32:46.7553151Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.7554232Z context = 2025-05-07T20:32:46.7554531Z 2025-05-07T20:32:46.7554701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.7555221Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.7555728Z module_map=module_map) 2025-05-07T20:32:46.7556088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.7556440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.7556698Z E ^ 2025-05-07T20:32:46.7557153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.7557591Z 2025-05-07T20:32:46.7557997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4002478Z 2025-05-07T20:32:47.4003156Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4003789Z self=, 2025-05-07T20:32:47.4004673Z T=2048, 2025-05-07T20:32:47.4004930Z D=5120, 2025-05-07T20:32:47.4005133Z scale_ub=1200.0, 2025-05-07T20:32:47.4005353Z contiguous=True, 2025-05-07T20:32:47.4005578Z compiled=True, 2025-05-07T20:32:47.4005790Z ) 2025-05-07T20:32:47.4006113Z self = 2025-05-07T20:32:47.4006610Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.4006886Z 2025-05-07T20:32:47.4006966Z @given( 2025-05-07T20:32:47.4007199Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4007506Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4007821Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4008151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4008469Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4008754Z ) 2025-05-07T20:32:47.4009104Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4009635Z def test_silu_mul_quant( 2025-05-07T20:32:47.4009884Z self, 2025-05-07T20:32:47.4010080Z T: int, 2025-05-07T20:32:47.4010274Z D: int, 2025-05-07T20:32:47.4010491Z scale_ub: Optional[float], 2025-05-07T20:32:47.4010760Z contiguous: bool, 2025-05-07T20:32:47.4011003Z compiled: bool, 2025-05-07T20:32:47.4011228Z ) -> None: 2025-05-07T20:32:47.4011447Z torch.manual_seed(2025) 2025-05-07T20:32:47.4011690Z 2025-05-07T20:32:47.4011956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4012298Z 2025-05-07T20:32:47.4012492Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4012773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4013082Z x = x_sign * x_clamp 2025-05-07T20:32:47.4013319Z x0 = x[:, :D] 2025-05-07T20:32:47.4013524Z x1 = x[:, D:] 2025-05-07T20:32:47.4013877Z 2025-05-07T20:32:47.4014061Z if contiguous: 2025-05-07T20:32:47.4014284Z x0 = x0.contiguous() 2025-05-07T20:32:47.4014543Z x1 = x1.contiguous() 2025-05-07T20:32:47.4014782Z 2025-05-07T20:32:47.4014965Z if scale_ub is not None: 2025-05-07T20:32:47.4015234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4015564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4015870Z ) 2025-05-07T20:32:47.4016056Z else: 2025-05-07T20:32:47.4016268Z scale_ub_tensor = None 2025-05-07T20:32:47.4016519Z 2025-05-07T20:32:47.4016743Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4017154Z op = silu_mul_quant 2025-05-07T20:32:47.4017404Z if compiled: 2025-05-07T20:32:47.4017681Z op = torch.compile(op) 2025-05-07T20:32:47.4017976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4018254Z 2025-05-07T20:32:47.4018451Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.4018737Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.4019022Z 2025-05-07T20:32:47.4019256Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4019681Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.4019966Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.4020279Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.4020633Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.4020953Z 2025-05-07T20:32:47.4021154Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.4021356Z 2025-05-07T20:32:47.4021460Z moe/activation_test.py:126: 2025-05-07T20:32:47.4021754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4022082Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.4022451Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.4023239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.4023978Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.4024518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4025198Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4025874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.4026578Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.4027297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.4027922Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.4028571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.4029079Z fn() 2025-05-07T20:32:47.4029580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.4030157Z self.fn.run( 2025-05-07T20:32:47.4030612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4031135Z kernel = self.compile( 2025-05-07T20:32:47.4031671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4032316Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4032705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4032937Z 2025-05-07T20:32:47.4033145Z self = 2025-05-07T20:32:47.4034217Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4035581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4eeaac0>} 2025-05-07T20:32:47.4036899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4037952Z context = 2025-05-07T20:32:47.4038241Z 2025-05-07T20:32:47.4038406Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4038969Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4039429Z module_map=module_map) 2025-05-07T20:32:47.4039795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4040200Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.4040469Z E ^ 2025-05-07T20:32:47.4040918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4041366Z 2025-05-07T20:32:47.4041774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4042275Z 2025-05-07T20:32:47.4042392Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4042802Z self=, 2025-05-07T20:32:47.4043193Z T=16384, 2025-05-07T20:32:47.4043392Z D=7168, 2025-05-07T20:32:47.4043587Z scale_ub=1200.0, 2025-05-07T20:32:47.4043853Z contiguous=False, 2025-05-07T20:32:47.4044080Z compiled=False, 2025-05-07T20:32:47.4044291Z ) 2025-05-07T20:32:47.4044598Z self = 2025-05-07T20:32:47.4045093Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.4045363Z 2025-05-07T20:32:47.4045445Z @given( 2025-05-07T20:32:47.4045668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4045984Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4046285Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4046608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4046931Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4047218Z ) 2025-05-07T20:32:47.4047561Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4047990Z def test_silu_mul_quant( 2025-05-07T20:32:47.4048239Z self, 2025-05-07T20:32:47.4048435Z T: int, 2025-05-07T20:32:47.4048677Z D: int, 2025-05-07T20:32:47.4048902Z scale_ub: Optional[float], 2025-05-07T20:32:47.4049169Z contiguous: bool, 2025-05-07T20:32:47.4049402Z compiled: bool, 2025-05-07T20:32:47.4049627Z ) -> None: 2025-05-07T20:32:47.4049841Z torch.manual_seed(2025) 2025-05-07T20:32:47.4050074Z 2025-05-07T20:32:47.4050342Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4050676Z 2025-05-07T20:32:47.4050868Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4051152Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4051461Z x = x_sign * x_clamp 2025-05-07T20:32:47.4051701Z x0 = x[:, :D] 2025-05-07T20:32:47.4051913Z x1 = x[:, D:] 2025-05-07T20:32:47.4052120Z 2025-05-07T20:32:47.4052308Z if contiguous: 2025-05-07T20:32:47.4052531Z x0 = x0.contiguous() 2025-05-07T20:32:47.4052790Z x1 = x1.contiguous() 2025-05-07T20:32:47.4053026Z 2025-05-07T20:32:47.4053214Z if scale_ub is not None: 2025-05-07T20:32:47.4053481Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4053929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4054268Z ) 2025-05-07T20:32:47.4054471Z else: 2025-05-07T20:32:47.4054692Z scale_ub_tensor = None 2025-05-07T20:32:47.4054959Z 2025-05-07T20:32:47.4055207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4055555Z op = silu_mul_quant 2025-05-07T20:32:47.4055823Z if compiled: 2025-05-07T20:32:47.4056126Z op = torch.compile(op) 2025-05-07T20:32:47.4056420Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4056696Z 2025-05-07T20:32:47.4056882Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.4057048Z 2025-05-07T20:32:47.4057142Z moe/activation_test.py:117: 2025-05-07T20:32:47.4057439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4057766Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.4058045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4058765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.4059439Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.4059971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4060643Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4061302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4061822Z kernel = self.compile( 2025-05-07T20:32:47.4062401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4063055Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4063450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4063679Z 2025-05-07T20:32:47.4063885Z self = 2025-05-07T20:32:47.4064941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4066291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c5ffa980>} 2025-05-07T20:32:47.4067613Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4068705Z context = 2025-05-07T20:32:47.4068992Z 2025-05-07T20:32:47.4069156Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4069669Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4070126Z module_map=module_map) 2025-05-07T20:32:47.4070482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4070831Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.4071091Z E ^ 2025-05-07T20:32:47.4071542Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4071986Z 2025-05-07T20:32:47.4072393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.0958909Z 2025-05-07T20:32:48.0959895Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.0960447Z self=, 2025-05-07T20:32:48.0960855Z T=1, 2025-05-07T20:32:48.0961054Z D=7168, 2025-05-07T20:32:48.0961244Z scale_ub=None, 2025-05-07T20:32:48.0961456Z contiguous=True, 2025-05-07T20:32:48.0961681Z compiled=True, 2025-05-07T20:32:48.0961884Z ) 2025-05-07T20:32:48.0962205Z self = 2025-05-07T20:32:48.0962693Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:48.0962954Z 2025-05-07T20:32:48.0963338Z @given( 2025-05-07T20:32:48.0963574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.0963893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.0964194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.0964531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.0964865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.0965154Z ) 2025-05-07T20:32:48.0965496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.0966039Z def test_silu_mul_quant( 2025-05-07T20:32:48.0966278Z self, 2025-05-07T20:32:48.0966467Z T: int, 2025-05-07T20:32:48.0966666Z D: int, 2025-05-07T20:32:48.0966885Z scale_ub: Optional[float], 2025-05-07T20:32:48.0967148Z contiguous: bool, 2025-05-07T20:32:48.0967387Z compiled: bool, 2025-05-07T20:32:48.0967613Z ) -> None: 2025-05-07T20:32:48.0967829Z torch.manual_seed(2025) 2025-05-07T20:32:48.0968075Z 2025-05-07T20:32:48.0968352Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.0968692Z 2025-05-07T20:32:48.0968886Z x_sign = torch.sign(x) 2025-05-07T20:32:48.0969296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.0969600Z x = x_sign * x_clamp 2025-05-07T20:32:48.0969848Z x0 = x[:, :D] 2025-05-07T20:32:48.0970068Z x1 = x[:, D:] 2025-05-07T20:32:48.0970270Z 2025-05-07T20:32:48.0970459Z if contiguous: 2025-05-07T20:32:48.0970698Z x0 = x0.contiguous() 2025-05-07T20:32:48.0970956Z x1 = x1.contiguous() 2025-05-07T20:32:48.0971190Z 2025-05-07T20:32:48.0971384Z if scale_ub is not None: 2025-05-07T20:32:48.0971662Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.0971997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.0972311Z ) 2025-05-07T20:32:48.0972508Z else: 2025-05-07T20:32:48.0972711Z scale_ub_tensor = None 2025-05-07T20:32:48.0972961Z 2025-05-07T20:32:48.0973191Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.0973501Z op = silu_mul_quant 2025-05-07T20:32:48.0973864Z if compiled: 2025-05-07T20:32:48.0974117Z op = torch.compile(op) 2025-05-07T20:32:48.0974495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0974770Z 2025-05-07T20:32:48.0974962Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.0975244Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.0975533Z 2025-05-07T20:32:48.0975770Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.0976121Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.0982355Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.0982679Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.0983048Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.0983364Z 2025-05-07T20:32:48.0983571Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.0983768Z 2025-05-07T20:32:48.0983869Z moe/activation_test.py:126: 2025-05-07T20:32:48.0984174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0984514Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.0984832Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.0985615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.0986363Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.0986908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.0987574Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.0988325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.0989036Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.0989755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.0990374Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.0990968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.0991525Z fn() 2025-05-07T20:32:48.0992018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.0992591Z self.fn.run( 2025-05-07T20:32:48.0993054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.0993581Z kernel = self.compile( 2025-05-07T20:32:48.0994112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.0994768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.0995206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0995435Z 2025-05-07T20:32:48.0995639Z self = 2025-05-07T20:32:48.0996708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.1001471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c519e5c0>} 2025-05-07T20:32:48.1002805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.1003813Z context = 2025-05-07T20:32:48.1004105Z 2025-05-07T20:32:48.1004367Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.1004877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.1005342Z module_map=module_map) 2025-05-07T20:32:48.1005708Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.1006055Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.1006322Z E ^ 2025-05-07T20:32:48.1006785Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.1007228Z 2025-05-07T20:32:48.1007647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.1008153Z 2025-05-07T20:32:48.1008255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.1008714Z self=, 2025-05-07T20:32:48.1009116Z T=4096, 2025-05-07T20:32:48.1009301Z D=5120, 2025-05-07T20:32:48.1009500Z scale_ub=None, 2025-05-07T20:32:48.1009714Z contiguous=False, 2025-05-07T20:32:48.1009942Z compiled=False, 2025-05-07T20:32:48.1010140Z ) 2025-05-07T20:32:48.1010453Z self = 2025-05-07T20:32:48.1010939Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.1011203Z 2025-05-07T20:32:48.1011284Z @given( 2025-05-07T20:32:48.1011504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.1011813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.1012191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.1012507Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.1012833Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.1013116Z ) 2025-05-07T20:32:48.1013457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.1013974Z def test_silu_mul_quant( 2025-05-07T20:32:48.1014211Z self, 2025-05-07T20:32:48.1014395Z T: int, 2025-05-07T20:32:48.1014655Z D: int, 2025-05-07T20:32:48.1014873Z scale_ub: Optional[float], 2025-05-07T20:32:48.1015136Z contiguous: bool, 2025-05-07T20:32:48.1015369Z compiled: bool, 2025-05-07T20:32:48.1015589Z ) -> None: 2025-05-07T20:32:48.1015804Z torch.manual_seed(2025) 2025-05-07T20:32:48.1016031Z 2025-05-07T20:32:48.1016299Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.1016637Z 2025-05-07T20:32:48.1016820Z x_sign = torch.sign(x) 2025-05-07T20:32:48.1017104Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.1017414Z x = x_sign * x_clamp 2025-05-07T20:32:48.1017646Z x0 = x[:, :D] 2025-05-07T20:32:48.1017926Z x1 = x[:, D:] 2025-05-07T20:32:48.1018132Z 2025-05-07T20:32:48.1018309Z if contiguous: 2025-05-07T20:32:48.1018541Z x0 = x0.contiguous() 2025-05-07T20:32:48.1018794Z x1 = x1.contiguous() 2025-05-07T20:32:48.1019024Z 2025-05-07T20:32:48.1019212Z if scale_ub is not None: 2025-05-07T20:32:48.1019477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.1019798Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.1020102Z ) 2025-05-07T20:32:48.1020294Z else: 2025-05-07T20:32:48.1020495Z scale_ub_tensor = None 2025-05-07T20:32:48.1020744Z 2025-05-07T20:32:48.1020975Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.1021287Z op = silu_mul_quant 2025-05-07T20:32:48.1021522Z if compiled: 2025-05-07T20:32:48.1021766Z op = torch.compile(op) 2025-05-07T20:32:48.1022060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1022328Z 2025-05-07T20:32:48.1022520Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.1022726Z 2025-05-07T20:32:48.1022831Z moe/activation_test.py:117: 2025-05-07T20:32:48.1023114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1023443Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.1023722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.1024405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.1025081Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.1025613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.1026287Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.1026931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.1027455Z kernel = self.compile( 2025-05-07T20:32:48.1027990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.1028634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.1029037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.1029268Z 2025-05-07T20:32:48.1029470Z self = 2025-05-07T20:32:48.1030528Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.1031925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c51be3e0>} 2025-05-07T20:32:48.1033245Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.1034279Z context = 2025-05-07T20:32:48.1034566Z 2025-05-07T20:32:48.1034728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.1035235Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.1035696Z module_map=module_map) 2025-05-07T20:32:48.1036054Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.1036397Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.1036653Z E ^ 2025-05-07T20:32:48.1037099Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.1037586Z 2025-05-07T20:32:48.1037993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.7935789Z 2025-05-07T20:32:48.7936145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.7936585Z self=, 2025-05-07T20:32:48.7937030Z T=4096, 2025-05-07T20:32:48.7937226Z D=7168, 2025-05-07T20:32:48.7937433Z scale_ub=None, 2025-05-07T20:32:48.7937659Z contiguous=False, 2025-05-07T20:32:48.7937895Z compiled=False, 2025-05-07T20:32:48.7938104Z ) 2025-05-07T20:32:48.7938425Z self = 2025-05-07T20:32:48.7938927Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.7939198Z 2025-05-07T20:32:48.7939278Z @given( 2025-05-07T20:32:48.7939513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.7939831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.7940251Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.7940587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.7940921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.7941204Z ) 2025-05-07T20:32:48.7941551Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.7941996Z def test_silu_mul_quant( 2025-05-07T20:32:48.7942239Z self, 2025-05-07T20:32:48.7942431Z T: int, 2025-05-07T20:32:48.7942632Z D: int, 2025-05-07T20:32:48.7942855Z scale_ub: Optional[float], 2025-05-07T20:32:48.7943128Z contiguous: bool, 2025-05-07T20:32:48.7943370Z compiled: bool, 2025-05-07T20:32:48.7943601Z ) -> None: 2025-05-07T20:32:48.7943815Z torch.manual_seed(2025) 2025-05-07T20:32:48.7944060Z 2025-05-07T20:32:48.7944336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.7944677Z 2025-05-07T20:32:48.7944878Z x_sign = torch.sign(x) 2025-05-07T20:32:48.7945171Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.7945476Z x = x_sign * x_clamp 2025-05-07T20:32:48.7945721Z x0 = x[:, :D] 2025-05-07T20:32:48.7945945Z x1 = x[:, D:] 2025-05-07T20:32:48.7946151Z 2025-05-07T20:32:48.7946343Z if contiguous: 2025-05-07T20:32:48.7946580Z x0 = x0.contiguous() 2025-05-07T20:32:48.7946842Z x1 = x1.contiguous() 2025-05-07T20:32:48.7947079Z 2025-05-07T20:32:48.7947278Z if scale_ub is not None: 2025-05-07T20:32:48.7947558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.7947997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.7948312Z ) 2025-05-07T20:32:48.7948533Z else: 2025-05-07T20:32:48.7948782Z scale_ub_tensor = None 2025-05-07T20:32:48.7949044Z 2025-05-07T20:32:48.7949286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.7949599Z op = silu_mul_quant 2025-05-07T20:32:48.7949858Z if compiled: 2025-05-07T20:32:48.7950115Z op = torch.compile(op) 2025-05-07T20:32:48.7950477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7950758Z 2025-05-07T20:32:48.7950960Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.7951124Z 2025-05-07T20:32:48.7951225Z moe/activation_test.py:117: 2025-05-07T20:32:48.7951526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7951862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.7952151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7952832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.7953516Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.7954117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.7954788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.7955459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.7955997Z kernel = self.compile( 2025-05-07T20:32:48.7956539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.7957194Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.7957588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7957824Z 2025-05-07T20:32:48.7958032Z self = 2025-05-07T20:32:48.7959206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.7960568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c51bede0>} 2025-05-07T20:32:48.7961895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.7962897Z context = 2025-05-07T20:32:48.7963191Z 2025-05-07T20:32:48.7963355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.7963871Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.7964330Z module_map=module_map) 2025-05-07T20:32:48.7964706Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.7965062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.7965324Z E ^ 2025-05-07T20:32:48.7965780Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.7966231Z 2025-05-07T20:32:48.7966639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.7967143Z 2025-05-07T20:32:48.7967255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.7967666Z self=, 2025-05-07T20:32:48.7968106Z T=128, 2025-05-07T20:32:48.7968291Z D=7168, 2025-05-07T20:32:48.7968490Z scale_ub=None, 2025-05-07T20:32:48.7968701Z contiguous=False, 2025-05-07T20:32:48.7968923Z compiled=True, 2025-05-07T20:32:48.7969126Z ) 2025-05-07T20:32:48.7969440Z self = 2025-05-07T20:32:48.7969930Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:48.7970194Z 2025-05-07T20:32:48.7970274Z @given( 2025-05-07T20:32:48.7970552Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.7970868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.7971176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.7971509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.7971832Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.7972118Z ) 2025-05-07T20:32:48.7972467Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.7972902Z def test_silu_mul_quant( 2025-05-07T20:32:48.7973142Z self, 2025-05-07T20:32:48.7973365Z T: int, 2025-05-07T20:32:48.7973565Z D: int, 2025-05-07T20:32:48.7973925Z scale_ub: Optional[float], 2025-05-07T20:32:48.7974200Z contiguous: bool, 2025-05-07T20:32:48.7974441Z compiled: bool, 2025-05-07T20:32:48.7974660Z ) -> None: 2025-05-07T20:32:48.7974873Z torch.manual_seed(2025) 2025-05-07T20:32:48.7975115Z 2025-05-07T20:32:48.7975380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.7975724Z 2025-05-07T20:32:48.7975918Z x_sign = torch.sign(x) 2025-05-07T20:32:48.7976205Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.7976514Z x = x_sign * x_clamp 2025-05-07T20:32:48.7976753Z x0 = x[:, :D] 2025-05-07T20:32:48.7976973Z x1 = x[:, D:] 2025-05-07T20:32:48.7977182Z 2025-05-07T20:32:48.7977370Z if contiguous: 2025-05-07T20:32:48.7977602Z x0 = x0.contiguous() 2025-05-07T20:32:48.7977853Z x1 = x1.contiguous() 2025-05-07T20:32:48.7978094Z 2025-05-07T20:32:48.7978288Z if scale_ub is not None: 2025-05-07T20:32:48.7978558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.7978988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.7979307Z ) 2025-05-07T20:32:48.7979500Z else: 2025-05-07T20:32:48.7979718Z scale_ub_tensor = None 2025-05-07T20:32:48.7979970Z 2025-05-07T20:32:48.7980195Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.7980513Z op = silu_mul_quant 2025-05-07T20:32:48.7980764Z if compiled: 2025-05-07T20:32:48.7981008Z op = torch.compile(op) 2025-05-07T20:32:48.7981304Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7981580Z 2025-05-07T20:32:48.7981772Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.7982050Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.7982344Z 2025-05-07T20:32:48.7982577Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.7982909Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.7983207Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.7983518Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.7983869Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.7984180Z 2025-05-07T20:32:48.7984383Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.7984574Z 2025-05-07T20:32:48.7984673Z moe/activation_test.py:126: 2025-05-07T20:32:48.7984970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7985309Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.7985635Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.7986454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.7987194Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.7987743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.7988411Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.7989130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.7989846Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.7990561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.7991193Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.7991791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.7992311Z fn() 2025-05-07T20:32:48.7992862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.7993435Z self.fn.run( 2025-05-07T20:32:48.7993902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.7994427Z kernel = self.compile( 2025-05-07T20:32:48.7994959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.7995605Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.7996002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7996229Z 2025-05-07T20:32:48.7996441Z self = 2025-05-07T20:32:48.7997506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.7999122Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4683a60>} 2025-05-07T20:32:48.8000447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8001460Z context = 2025-05-07T20:32:48.8001745Z 2025-05-07T20:32:48.8001917Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8002424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8002889Z module_map=module_map) 2025-05-07T20:32:48.8003254Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8003603Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.8003878Z E ^ 2025-05-07T20:32:48.8004340Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8004778Z 2025-05-07T20:32:48.8005195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.0371011Z 2025-05-07T20:32:49.0371298Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0371734Z self=, 2025-05-07T20:32:49.0372163Z T=128, 2025-05-07T20:32:49.0372355Z D=7168, 2025-05-07T20:32:49.0372560Z scale_ub=None, 2025-05-07T20:32:49.0372929Z contiguous=False, 2025-05-07T20:32:49.0373163Z compiled=False, 2025-05-07T20:32:49.0373373Z ) 2025-05-07T20:32:49.0373780Z self = 2025-05-07T20:32:49.0374267Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:49.0374540Z 2025-05-07T20:32:49.0374624Z @given( 2025-05-07T20:32:49.0374862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.0375175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.0375551Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.0375884Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.0376212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.0376494Z ) 2025-05-07T20:32:49.0376841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.0377284Z def test_silu_mul_quant( 2025-05-07T20:32:49.0377524Z self, 2025-05-07T20:32:49.0377721Z T: int, 2025-05-07T20:32:49.0377926Z D: int, 2025-05-07T20:32:49.0378140Z scale_ub: Optional[float], 2025-05-07T20:32:49.0378415Z contiguous: bool, 2025-05-07T20:32:49.0378654Z compiled: bool, 2025-05-07T20:32:49.0378941Z ) -> None: 2025-05-07T20:32:49.0379162Z torch.manual_seed(2025) 2025-05-07T20:32:49.0379405Z 2025-05-07T20:32:49.0379677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.0380013Z 2025-05-07T20:32:49.0380206Z x_sign = torch.sign(x) 2025-05-07T20:32:49.0380501Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.0380803Z x = x_sign * x_clamp 2025-05-07T20:32:49.0381043Z x0 = x[:, :D] 2025-05-07T20:32:49.0381256Z x1 = x[:, D:] 2025-05-07T20:32:49.0381462Z 2025-05-07T20:32:49.0381651Z if contiguous: 2025-05-07T20:32:49.0381884Z x0 = x0.contiguous() 2025-05-07T20:32:49.0382141Z x1 = x1.contiguous() 2025-05-07T20:32:49.0382382Z 2025-05-07T20:32:49.0382578Z if scale_ub is not None: 2025-05-07T20:32:49.0382846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.0383186Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.0383501Z ) 2025-05-07T20:32:49.0383688Z else: 2025-05-07T20:32:49.0384043Z scale_ub_tensor = None 2025-05-07T20:32:49.0384295Z 2025-05-07T20:32:49.0384527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0384847Z op = silu_mul_quant 2025-05-07T20:32:49.0385094Z if compiled: 2025-05-07T20:32:49.0385344Z op = torch.compile(op) 2025-05-07T20:32:49.0385641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0385914Z 2025-05-07T20:32:49.0386107Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.0386273Z 2025-05-07T20:32:49.0386372Z moe/activation_test.py:117: 2025-05-07T20:32:49.0386672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0386998Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.0387286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0387982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.0388660Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.0389204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.0389882Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.0390542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.0392415Z kernel = self.compile( 2025-05-07T20:32:49.0392955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.0394127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.0394518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0394755Z 2025-05-07T20:32:49.0400933Z self = 2025-05-07T20:32:49.0402026Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.0403504Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4105e40>} 2025-05-07T20:32:49.0404833Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.0405853Z context = 2025-05-07T20:32:49.0406149Z 2025-05-07T20:32:49.0406317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.0406907Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.0407371Z module_map=module_map) 2025-05-07T20:32:49.0407745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.0408112Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.0408380Z E ^ 2025-05-07T20:32:49.0408845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.0409297Z 2025-05-07T20:32:49.0409709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.0410220Z 2025-05-07T20:32:49.0410332Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0410738Z self=, 2025-05-07T20:32:49.0411143Z T=4096, 2025-05-07T20:32:49.0411339Z D=5120, 2025-05-07T20:32:49.0411540Z scale_ub=1200.0, 2025-05-07T20:32:49.0411764Z contiguous=True, 2025-05-07T20:32:49.0412051Z compiled=False, 2025-05-07T20:32:49.0412264Z ) 2025-05-07T20:32:49.0412584Z self = 2025-05-07T20:32:49.0413084Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:49.0413357Z 2025-05-07T20:32:49.0413447Z @given( 2025-05-07T20:32:49.0413747Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.0414100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.0414448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.0414816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.0415151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.0415447Z ) 2025-05-07T20:32:49.0415802Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.0416245Z def test_silu_mul_quant( 2025-05-07T20:32:49.0416493Z self, 2025-05-07T20:32:49.0416696Z T: int, 2025-05-07T20:32:49.0416892Z D: int, 2025-05-07T20:32:49.0417115Z scale_ub: Optional[float], 2025-05-07T20:32:49.0417396Z contiguous: bool, 2025-05-07T20:32:49.0417634Z compiled: bool, 2025-05-07T20:32:49.0417866Z ) -> None: 2025-05-07T20:32:49.0418088Z torch.manual_seed(2025) 2025-05-07T20:32:49.0418327Z 2025-05-07T20:32:49.0418602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.0418955Z 2025-05-07T20:32:49.0419150Z x_sign = torch.sign(x) 2025-05-07T20:32:49.0419449Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.0419838Z x = x_sign * x_clamp 2025-05-07T20:32:49.0420079Z x0 = x[:, :D] 2025-05-07T20:32:49.0420297Z x1 = x[:, D:] 2025-05-07T20:32:49.0420507Z 2025-05-07T20:32:49.0420696Z if contiguous: 2025-05-07T20:32:49.0420925Z x0 = x0.contiguous() 2025-05-07T20:32:49.0421189Z x1 = x1.contiguous() 2025-05-07T20:32:49.0421438Z 2025-05-07T20:32:49.0421632Z if scale_ub is not None: 2025-05-07T20:32:49.0421911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.0422309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.0422619Z ) 2025-05-07T20:32:49.0422811Z else: 2025-05-07T20:32:49.0423023Z scale_ub_tensor = None 2025-05-07T20:32:49.0423279Z 2025-05-07T20:32:49.0423504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0423820Z op = silu_mul_quant 2025-05-07T20:32:49.0424072Z if compiled: 2025-05-07T20:32:49.0424322Z op = torch.compile(op) 2025-05-07T20:32:49.0424621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0424898Z 2025-05-07T20:32:49.0425086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.0425252Z 2025-05-07T20:32:49.0425401Z moe/activation_test.py:117: 2025-05-07T20:32:49.0425703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0426036Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.0426316Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0427007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.0427698Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.0428232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.0428909Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.0429575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.0430108Z kernel = self.compile( 2025-05-07T20:32:49.0430651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.0431347Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.0431748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0431975Z 2025-05-07T20:32:49.0432183Z self = 2025-05-07T20:32:49.0433255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.0434617Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c41068e0>} 2025-05-07T20:32:49.0435948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.0436961Z context = 2025-05-07T20:32:49.0437247Z 2025-05-07T20:32:49.0437415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.0437934Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.0438400Z module_map=module_map) 2025-05-07T20:32:49.0438770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.0439124Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.0439463Z E ^ 2025-05-07T20:32:49.0439947Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.0440393Z 2025-05-07T20:32:49.0440805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.0441319Z 2025-05-07T20:32:49.0441432Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0441845Z self=, 2025-05-07T20:32:49.0442291Z T=1, 2025-05-07T20:32:49.0442472Z D=5120, 2025-05-07T20:32:49.0442669Z scale_ub=None, 2025-05-07T20:32:49.0442887Z contiguous=True, 2025-05-07T20:32:49.0443107Z compiled=True, 2025-05-07T20:32:49.0443311Z ) 2025-05-07T20:32:49.0443633Z self = 2025-05-07T20:32:49.0444105Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.0444367Z 2025-05-07T20:32:49.0444445Z @given( 2025-05-07T20:32:49.0444672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.0444980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.0445284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.0445653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.0445983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.0446264Z ) 2025-05-07T20:32:49.0446610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.0447052Z def test_silu_mul_quant( 2025-05-07T20:32:49.0447287Z self, 2025-05-07T20:32:49.0447479Z T: int, 2025-05-07T20:32:49.0447674Z D: int, 2025-05-07T20:32:49.0447884Z scale_ub: Optional[float], 2025-05-07T20:32:49.0448152Z contiguous: bool, 2025-05-07T20:32:49.0448391Z compiled: bool, 2025-05-07T20:32:49.0448605Z ) -> None: 2025-05-07T20:32:49.0448823Z torch.manual_seed(2025) 2025-05-07T20:32:49.0449060Z 2025-05-07T20:32:49.0449345Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.0449706Z 2025-05-07T20:32:49.0449899Z x_sign = torch.sign(x) 2025-05-07T20:32:49.0450192Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.0450541Z x = x_sign * x_clamp 2025-05-07T20:32:49.0450782Z x0 = x[:, :D] 2025-05-07T20:32:49.0451001Z x1 = x[:, D:] 2025-05-07T20:32:49.0451199Z 2025-05-07T20:32:49.0451381Z if contiguous: 2025-05-07T20:32:49.0451608Z x0 = x0.contiguous() 2025-05-07T20:32:49.0451856Z x1 = x1.contiguous() 2025-05-07T20:32:49.0452098Z 2025-05-07T20:32:49.0452289Z if scale_ub is not None: 2025-05-07T20:32:49.0452556Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.0452883Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.0453187Z ) 2025-05-07T20:32:49.0453371Z else: 2025-05-07T20:32:49.0453577Z scale_ub_tensor = None 2025-05-07T20:32:49.0453896Z 2025-05-07T20:32:49.0454117Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0454429Z op = silu_mul_quant 2025-05-07T20:32:49.0454680Z if compiled: 2025-05-07T20:32:49.0454924Z op = torch.compile(op) 2025-05-07T20:32:49.0455214Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0455486Z 2025-05-07T20:32:49.0455677Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.0455950Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.0456239Z 2025-05-07T20:32:49.0456471Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0456798Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.0457084Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.0457394Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.0457791Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.0458095Z 2025-05-07T20:32:49.0458295Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.0458484Z 2025-05-07T20:32:49.0458586Z moe/activation_test.py:126: 2025-05-07T20:32:49.0458875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0459204Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.0459524Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.0460335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.0461072Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.0461610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.0462278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.0462953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.0463663Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.0464444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.0465067Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.0465664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.0466178Z fn() 2025-05-07T20:32:49.0466675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.0467241Z self.fn.run( 2025-05-07T20:32:49.0467701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.0468223Z kernel = self.compile( 2025-05-07T20:32:49.0468749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.0469399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.0469882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0470108Z 2025-05-07T20:32:49.0470319Z self = 2025-05-07T20:32:49.0471385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.0472730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4107560>} 2025-05-07T20:32:49.0474057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.0475064Z context = 2025-05-07T20:32:49.0475345Z 2025-05-07T20:32:49.0475515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.0476020Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.0476483Z module_map=module_map) 2025-05-07T20:32:49.0476847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.0477193Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.0477458Z E ^ 2025-05-07T20:32:49.0477912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.0478402Z 2025-05-07T20:32:49.0478807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7311858Z 2025-05-07T20:32:49.7312033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7312476Z self=, 2025-05-07T20:32:49.7312955Z T=2048, 2025-05-07T20:32:49.7313148Z D=5120, 2025-05-07T20:32:49.7313384Z scale_ub=None, 2025-05-07T20:32:49.7313608Z contiguous=True, 2025-05-07T20:32:49.7313941Z compiled=True, 2025-05-07T20:32:49.7314147Z ) 2025-05-07T20:32:49.7314466Z self = 2025-05-07T20:32:49.7314949Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.7315222Z 2025-05-07T20:32:49.7315305Z @given( 2025-05-07T20:32:49.7315535Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7315851Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7316150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7316475Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7316799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7317145Z ) 2025-05-07T20:32:49.7317494Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7317930Z def test_silu_mul_quant( 2025-05-07T20:32:49.7318167Z self, 2025-05-07T20:32:49.7318359Z T: int, 2025-05-07T20:32:49.7318561Z D: int, 2025-05-07T20:32:49.7318771Z scale_ub: Optional[float], 2025-05-07T20:32:49.7319039Z contiguous: bool, 2025-05-07T20:32:49.7319272Z compiled: bool, 2025-05-07T20:32:49.7319490Z ) -> None: 2025-05-07T20:32:49.7319706Z torch.manual_seed(2025) 2025-05-07T20:32:49.7319944Z 2025-05-07T20:32:49.7320214Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7320545Z 2025-05-07T20:32:49.7320732Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7321020Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7321319Z x = x_sign * x_clamp 2025-05-07T20:32:49.7321562Z x0 = x[:, :D] 2025-05-07T20:32:49.7321783Z x1 = x[:, D:] 2025-05-07T20:32:49.7321980Z 2025-05-07T20:32:49.7322232Z if contiguous: 2025-05-07T20:32:49.7322465Z x0 = x0.contiguous() 2025-05-07T20:32:49.7322716Z x1 = x1.contiguous() 2025-05-07T20:32:49.7322960Z 2025-05-07T20:32:49.7323151Z if scale_ub is not None: 2025-05-07T20:32:49.7323416Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7323745Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7324046Z ) 2025-05-07T20:32:49.7324234Z else: 2025-05-07T20:32:49.7324442Z scale_ub_tensor = None 2025-05-07T20:32:49.7324688Z 2025-05-07T20:32:49.7324914Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7325229Z op = silu_mul_quant 2025-05-07T20:32:49.7325473Z if compiled: 2025-05-07T20:32:49.7325717Z op = torch.compile(op) 2025-05-07T20:32:49.7326009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7326280Z 2025-05-07T20:32:49.7326474Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.7326751Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.7327040Z 2025-05-07T20:32:49.7327273Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7327595Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.7327878Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.7328185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.7328531Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7328842Z 2025-05-07T20:32:49.7329113Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.7329306Z 2025-05-07T20:32:49.7329409Z moe/activation_test.py:126: 2025-05-07T20:32:49.7329698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7330024Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.7330347Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7331122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.7331904Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.7332442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7333112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7333871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.7334587Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.7335297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.7335973Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.7336567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.7337104Z fn() 2025-05-07T20:32:49.7337604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.7338174Z self.fn.run( 2025-05-07T20:32:49.7338629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7339198Z kernel = self.compile( 2025-05-07T20:32:49.7339733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7340381Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7340774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7341002Z 2025-05-07T20:32:49.7341207Z self = 2025-05-07T20:32:49.7342312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7343665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b944cc0>} 2025-05-07T20:32:49.7344975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7345979Z context = 2025-05-07T20:32:49.7346264Z 2025-05-07T20:32:49.7346430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7346942Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7347397Z module_map=module_map) 2025-05-07T20:32:49.7347761Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7348117Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.7348381Z E ^ 2025-05-07T20:32:49.7348835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7349281Z 2025-05-07T20:32:49.7349689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7350237Z 2025-05-07T20:32:49.7350348Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7350751Z self=, 2025-05-07T20:32:49.7351147Z T=128, 2025-05-07T20:32:49.7351336Z D=5120, 2025-05-07T20:32:49.7351523Z scale_ub=None, 2025-05-07T20:32:49.7351737Z contiguous=True, 2025-05-07T20:32:49.7351958Z compiled=True, 2025-05-07T20:32:49.7352165Z ) 2025-05-07T20:32:49.7352474Z self = 2025-05-07T20:32:49.7352997Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.7353254Z 2025-05-07T20:32:49.7353334Z @given( 2025-05-07T20:32:49.7353562Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7353875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7354178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7354510Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7354835Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7355128Z ) 2025-05-07T20:32:49.7355473Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7355957Z def test_silu_mul_quant( 2025-05-07T20:32:49.7356203Z self, 2025-05-07T20:32:49.7356406Z T: int, 2025-05-07T20:32:49.7356603Z D: int, 2025-05-07T20:32:49.7356820Z scale_ub: Optional[float], 2025-05-07T20:32:49.7357097Z contiguous: bool, 2025-05-07T20:32:49.7357332Z compiled: bool, 2025-05-07T20:32:49.7357554Z ) -> None: 2025-05-07T20:32:49.7357775Z torch.manual_seed(2025) 2025-05-07T20:32:49.7358010Z 2025-05-07T20:32:49.7358281Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7358619Z 2025-05-07T20:32:49.7358811Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7359097Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7359409Z x = x_sign * x_clamp 2025-05-07T20:32:49.7359643Z x0 = x[:, :D] 2025-05-07T20:32:49.7359856Z x1 = x[:, D:] 2025-05-07T20:32:49.7360066Z 2025-05-07T20:32:49.7360248Z if contiguous: 2025-05-07T20:32:49.7360482Z x0 = x0.contiguous() 2025-05-07T20:32:49.7360786Z x1 = x1.contiguous() 2025-05-07T20:32:49.7361023Z 2025-05-07T20:32:49.7361212Z if scale_ub is not None: 2025-05-07T20:32:49.7361483Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7361814Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7362117Z ) 2025-05-07T20:32:49.7362311Z else: 2025-05-07T20:32:49.7362523Z scale_ub_tensor = None 2025-05-07T20:32:49.7362767Z 2025-05-07T20:32:49.7362997Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7363306Z op = silu_mul_quant 2025-05-07T20:32:49.7363551Z if compiled: 2025-05-07T20:32:49.7363799Z op = torch.compile(op) 2025-05-07T20:32:49.7364091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7364358Z 2025-05-07T20:32:49.7364553Z y_fp8, y_scale = fn() 2025-05-07T20:32:49.7364833Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:49.7365115Z 2025-05-07T20:32:49.7365353Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7365689Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:49.7365981Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:49.7366288Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:49.7366640Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7366945Z 2025-05-07T20:32:49.7367144Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:49.7367338Z 2025-05-07T20:32:49.7367439Z moe/activation_test.py:126: 2025-05-07T20:32:49.7367783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7368109Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:49.7368433Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:49.7369209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:49.7369943Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:49.7370476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7371215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7371894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:49.7372600Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:49.7373310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:49.7374026Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:49.7374673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:49.7375186Z fn() 2025-05-07T20:32:49.7375693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:49.7376272Z self.fn.run( 2025-05-07T20:32:49.7376738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7377260Z kernel = self.compile( 2025-05-07T20:32:49.7377808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7383333Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7383742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7383978Z 2025-05-07T20:32:49.7384185Z self = 2025-05-07T20:32:49.7385326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7386683Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ba11d00>} 2025-05-07T20:32:49.7388012Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7389050Z context = 2025-05-07T20:32:49.7389354Z 2025-05-07T20:32:49.7389520Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7390033Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7390496Z module_map=module_map) 2025-05-07T20:32:49.7390857Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7391216Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:49.7391483Z E ^ 2025-05-07T20:32:49.7391934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7392377Z 2025-05-07T20:32:49.7392788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5067178Z 2025-05-07T20:32:50.5067466Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5067924Z self=, 2025-05-07T20:32:50.5068475Z T=4096, 2025-05-07T20:32:50.5068672Z D=5120, 2025-05-07T20:32:50.5068940Z scale_ub=None, 2025-05-07T20:32:50.5069397Z contiguous=True, 2025-05-07T20:32:50.5069854Z compiled=True, 2025-05-07T20:32:50.5070271Z ) 2025-05-07T20:32:50.5070912Z self = 2025-05-07T20:32:50.5071888Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.5072546Z 2025-05-07T20:32:50.5072711Z @given( 2025-05-07T20:32:50.5073174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5073799Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5074416Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5075072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5075716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5076295Z ) 2025-05-07T20:32:50.5076999Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5077880Z def test_silu_mul_quant( 2025-05-07T20:32:50.5078365Z self, 2025-05-07T20:32:50.5078762Z T: int, 2025-05-07T20:32:50.5079107Z D: int, 2025-05-07T20:32:50.5079426Z scale_ub: Optional[float], 2025-05-07T20:32:50.5079754Z contiguous: bool, 2025-05-07T20:32:50.5079993Z compiled: bool, 2025-05-07T20:32:50.5080224Z ) -> None: 2025-05-07T20:32:50.5080445Z torch.manual_seed(2025) 2025-05-07T20:32:50.5080684Z 2025-05-07T20:32:50.5080956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5081306Z 2025-05-07T20:32:50.5081503Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5081796Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5082109Z x = x_sign * x_clamp 2025-05-07T20:32:50.5082349Z x0 = x[:, :D] 2025-05-07T20:32:50.5082573Z x1 = x[:, D:] 2025-05-07T20:32:50.5082782Z 2025-05-07T20:32:50.5082972Z if contiguous: 2025-05-07T20:32:50.5083208Z x0 = x0.contiguous() 2025-05-07T20:32:50.5083469Z x1 = x1.contiguous() 2025-05-07T20:32:50.5083705Z 2025-05-07T20:32:50.5083903Z if scale_ub is not None: 2025-05-07T20:32:50.5084245Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5084588Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5084899Z ) 2025-05-07T20:32:50.5085098Z else: 2025-05-07T20:32:50.5085314Z scale_ub_tensor = None 2025-05-07T20:32:50.5085566Z 2025-05-07T20:32:50.5085806Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5086125Z op = silu_mul_quant 2025-05-07T20:32:50.5086374Z if compiled: 2025-05-07T20:32:50.5086625Z op = torch.compile(op) 2025-05-07T20:32:50.5086926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5087202Z 2025-05-07T20:32:50.5087403Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.5087689Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.5087980Z 2025-05-07T20:32:50.5088220Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5088558Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.5088851Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.5089160Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.5089519Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5089832Z 2025-05-07T20:32:50.5090029Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.5090228Z 2025-05-07T20:32:50.5090330Z moe/activation_test.py:126: 2025-05-07T20:32:50.5090630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5090960Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.5091339Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5092126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.5092869Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.5093416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5094216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5094948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.5095667Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.5096383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.5097019Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.5097621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.5098134Z fn() 2025-05-07T20:32:50.5098878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.5099467Z self.fn.run( 2025-05-07T20:32:50.5099936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5100462Z kernel = self.compile( 2025-05-07T20:32:50.5100999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5101650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5102045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5102278Z 2025-05-07T20:32:50.5102489Z self = 2025-05-07T20:32:50.5103561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5104984Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b80f2e0>} 2025-05-07T20:32:50.5106320Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5107324Z context = 2025-05-07T20:32:50.5107612Z 2025-05-07T20:32:50.5107777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5108302Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5108773Z module_map=module_map) 2025-05-07T20:32:50.5109163Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5109551Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.5109822Z E ^ 2025-05-07T20:32:50.5110280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5110731Z 2025-05-07T20:32:50.5111142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5111655Z 2025-05-07T20:32:50.5111760Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.5112171Z self=, 2025-05-07T20:32:50.5112568Z T=16384, 2025-05-07T20:32:50.5112772Z D=5120, 2025-05-07T20:32:50.5113037Z scale_ub=None, 2025-05-07T20:32:50.5113246Z contiguous=True, 2025-05-07T20:32:50.5113473Z compiled=True, 2025-05-07T20:32:50.5113678Z ) 2025-05-07T20:32:50.5113991Z self = 2025-05-07T20:32:50.5114484Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:50.5114758Z 2025-05-07T20:32:50.5114841Z @given( 2025-05-07T20:32:50.5115067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.5115445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.5115754Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.5116087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.5116414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.5116702Z ) 2025-05-07T20:32:50.5117049Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.5117491Z def test_silu_mul_quant( 2025-05-07T20:32:50.5117734Z self, 2025-05-07T20:32:50.5117934Z T: int, 2025-05-07T20:32:50.5118133Z D: int, 2025-05-07T20:32:50.5118352Z scale_ub: Optional[float], 2025-05-07T20:32:50.5118625Z contiguous: bool, 2025-05-07T20:32:50.5118911Z compiled: bool, 2025-05-07T20:32:50.5119137Z ) -> None: 2025-05-07T20:32:50.5119361Z torch.manual_seed(2025) 2025-05-07T20:32:50.5119598Z 2025-05-07T20:32:50.5119875Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.5120224Z 2025-05-07T20:32:50.5120421Z x_sign = torch.sign(x) 2025-05-07T20:32:50.5120708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.5121021Z x = x_sign * x_clamp 2025-05-07T20:32:50.5121263Z x0 = x[:, :D] 2025-05-07T20:32:50.5121476Z x1 = x[:, D:] 2025-05-07T20:32:50.5121684Z 2025-05-07T20:32:50.5121871Z if contiguous: 2025-05-07T20:32:50.5122101Z x0 = x0.contiguous() 2025-05-07T20:32:50.5122359Z x1 = x1.contiguous() 2025-05-07T20:32:50.5122600Z 2025-05-07T20:32:50.5122790Z if scale_ub is not None: 2025-05-07T20:32:50.5123066Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.5123403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.5123757Z ) 2025-05-07T20:32:50.5123954Z else: 2025-05-07T20:32:50.5124168Z scale_ub_tensor = None 2025-05-07T20:32:50.5124414Z 2025-05-07T20:32:50.5124650Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5124964Z op = silu_mul_quant 2025-05-07T20:32:50.5125219Z if compiled: 2025-05-07T20:32:50.5125464Z op = torch.compile(op) 2025-05-07T20:32:50.5125760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.5126042Z 2025-05-07T20:32:50.5126232Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.5126515Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.5126812Z 2025-05-07T20:32:50.5127044Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.5127380Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.5127679Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.5127992Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.5128355Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5128667Z 2025-05-07T20:32:50.5128866Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.5129069Z 2025-05-07T20:32:50.5129169Z moe/activation_test.py:126: 2025-05-07T20:32:50.5129466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5129802Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.5130125Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.5130901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.5131721Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.5132267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.5132949Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.5133630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.5134496Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.5135220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.5135853Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.5136448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.5136968Z fn() 2025-05-07T20:32:50.5137475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.5138059Z self.fn.run( 2025-05-07T20:32:50.5138569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.5139107Z kernel = self.compile( 2025-05-07T20:32:50.5139648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.5140296Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.5140700Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.5140937Z 2025-05-07T20:32:50.5141144Z self = 2025-05-07T20:32:50.5142211Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.5143607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b52ce00>} 2025-05-07T20:32:50.5144934Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.5145944Z context = 2025-05-07T20:32:50.5146229Z 2025-05-07T20:32:50.5146401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.5146913Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.5147379Z module_map=module_map) 2025-05-07T20:32:50.5147747Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.5148102Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.5148365Z E ^ 2025-05-07T20:32:50.5148824Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.5149270Z 2025-05-07T20:32:50.5149687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.5343478Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:50.5344697Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:50.5346105Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:50.5347089Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:50.5348177Z W0507 20:32:50.533000 276945 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:50.9356129Z 2025-05-07T20:32:50.9356374Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9356794Z self=, 2025-05-07T20:32:50.9357233Z T=1, 2025-05-07T20:32:50.9357437Z D=5120, 2025-05-07T20:32:50.9357642Z scale_ub=1200.0, 2025-05-07T20:32:50.9357868Z contiguous=True, 2025-05-07T20:32:50.9358091Z compiled=True, 2025-05-07T20:32:50.9358305Z ) 2025-05-07T20:32:50.9358628Z self = 2025-05-07T20:32:50.9359105Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.9359392Z 2025-05-07T20:32:50.9359476Z @given( 2025-05-07T20:32:50.9359910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9360226Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9360535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9360877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9361199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9361492Z ) 2025-05-07T20:32:50.9361843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9362282Z def test_silu_mul_quant( 2025-05-07T20:32:50.9362522Z self, 2025-05-07T20:32:50.9362717Z T: int, 2025-05-07T20:32:50.9362916Z D: int, 2025-05-07T20:32:50.9363137Z scale_ub: Optional[float], 2025-05-07T20:32:50.9363407Z contiguous: bool, 2025-05-07T20:32:50.9363645Z compiled: bool, 2025-05-07T20:32:50.9363863Z ) -> None: 2025-05-07T20:32:50.9364079Z torch.manual_seed(2025) 2025-05-07T20:32:50.9364323Z 2025-05-07T20:32:50.9364676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9365020Z 2025-05-07T20:32:50.9365215Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9365498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9365805Z x = x_sign * x_clamp 2025-05-07T20:32:50.9366043Z x0 = x[:, :D] 2025-05-07T20:32:50.9366255Z x1 = x[:, D:] 2025-05-07T20:32:50.9366462Z 2025-05-07T20:32:50.9366646Z if contiguous: 2025-05-07T20:32:50.9366877Z x0 = x0.contiguous() 2025-05-07T20:32:50.9367134Z x1 = x1.contiguous() 2025-05-07T20:32:50.9367374Z 2025-05-07T20:32:50.9367571Z if scale_ub is not None: 2025-05-07T20:32:50.9367841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9368184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9368493Z ) 2025-05-07T20:32:50.9368685Z else: 2025-05-07T20:32:50.9368903Z scale_ub_tensor = None 2025-05-07T20:32:50.9369161Z 2025-05-07T20:32:50.9369392Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9369706Z op = silu_mul_quant 2025-05-07T20:32:50.9369958Z if compiled: 2025-05-07T20:32:50.9370202Z op = torch.compile(op) 2025-05-07T20:32:50.9370498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9370776Z 2025-05-07T20:32:50.9370965Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.9371131Z 2025-05-07T20:32:50.9371230Z moe/activation_test.py:117: 2025-05-07T20:32:50.9371524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9371925Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.9372206Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9372762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.9373322Z return fn(*args, **kwargs) 2025-05-07T20:32:50.9374083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.9374770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.9375371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9376043Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9376695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9377227Z kernel = self.compile( 2025-05-07T20:32:50.9377770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9378418Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9378859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9379096Z 2025-05-07T20:32:50.9379305Z self = 2025-05-07T20:32:50.9380367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9381723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b52cfe0>} 2025-05-07T20:32:50.9383046Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9384055Z context = 2025-05-07T20:32:50.9384344Z 2025-05-07T20:32:50.9384512Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9385065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9385524Z module_map=module_map) 2025-05-07T20:32:50.9385887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9386240Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.9386496Z E ^ 2025-05-07T20:32:50.9386952Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9387398Z 2025-05-07T20:32:50.9387834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.9388347Z 2025-05-07T20:32:50.9388452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.9388861Z self=, 2025-05-07T20:32:50.9389263Z T=1, 2025-05-07T20:32:50.9389451Z D=5120, 2025-05-07T20:32:50.9389652Z scale_ub=None, 2025-05-07T20:32:50.9389865Z contiguous=False, 2025-05-07T20:32:50.9390092Z compiled=True, 2025-05-07T20:32:50.9390303Z ) 2025-05-07T20:32:50.9390620Z self = 2025-05-07T20:32:50.9391102Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.9391362Z 2025-05-07T20:32:50.9391443Z @given( 2025-05-07T20:32:50.9391676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.9391992Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.9392297Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.9392674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.9392995Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.9393285Z ) 2025-05-07T20:32:50.9393634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.9394073Z def test_silu_mul_quant( 2025-05-07T20:32:50.9394313Z self, 2025-05-07T20:32:50.9394512Z T: int, 2025-05-07T20:32:50.9394704Z D: int, 2025-05-07T20:32:50.9394966Z scale_ub: Optional[float], 2025-05-07T20:32:50.9395237Z contiguous: bool, 2025-05-07T20:32:50.9395476Z compiled: bool, 2025-05-07T20:32:50.9395697Z ) -> None: 2025-05-07T20:32:50.9395914Z torch.manual_seed(2025) 2025-05-07T20:32:50.9396157Z 2025-05-07T20:32:50.9396421Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.9396762Z 2025-05-07T20:32:50.9396966Z x_sign = torch.sign(x) 2025-05-07T20:32:50.9397253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.9397561Z x = x_sign * x_clamp 2025-05-07T20:32:50.9397801Z x0 = x[:, :D] 2025-05-07T20:32:50.9398014Z x1 = x[:, D:] 2025-05-07T20:32:50.9398401Z 2025-05-07T20:32:50.9398663Z if contiguous: 2025-05-07T20:32:50.9398896Z x0 = x0.contiguous() 2025-05-07T20:32:50.9399154Z x1 = x1.contiguous() 2025-05-07T20:32:50.9399396Z 2025-05-07T20:32:50.9399584Z if scale_ub is not None: 2025-05-07T20:32:50.9399865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.9400201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.9400507Z ) 2025-05-07T20:32:50.9400696Z else: 2025-05-07T20:32:50.9400907Z scale_ub_tensor = None 2025-05-07T20:32:50.9401160Z 2025-05-07T20:32:50.9401391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9401709Z op = silu_mul_quant 2025-05-07T20:32:50.9401959Z if compiled: 2025-05-07T20:32:50.9402202Z op = torch.compile(op) 2025-05-07T20:32:50.9402500Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.9402777Z 2025-05-07T20:32:50.9402970Z y_fp8, y_scale = fn() 2025-05-07T20:32:50.9403345Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:50.9403640Z 2025-05-07T20:32:50.9403874Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.9404212Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:50.9404509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:50.9404818Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:50.9405177Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.9405488Z 2025-05-07T20:32:50.9405690Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:50.9405882Z 2025-05-07T20:32:50.9405986Z moe/activation_test.py:126: 2025-05-07T20:32:50.9406287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9406623Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:50.9406945Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:50.9407730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:50.9408473Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:50.9409020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.9409692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.9410384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:50.9416622Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:50.9417473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:50.9418107Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:50.9418715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:50.9419232Z fn() 2025-05-07T20:32:50.9419739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:50.9420377Z self.fn.run( 2025-05-07T20:32:50.9420843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.9421369Z kernel = self.compile( 2025-05-07T20:32:50.9421903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.9422550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.9422948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.9423176Z 2025-05-07T20:32:50.9423389Z self = 2025-05-07T20:32:50.9424497Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.9425854Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab27ec0>} 2025-05-07T20:32:50.9427177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.9428187Z context = 2025-05-07T20:32:50.9428468Z 2025-05-07T20:32:50.9428641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.9429157Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.9429661Z module_map=module_map) 2025-05-07T20:32:50.9430034Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.9430387Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:50.9430661Z E ^ 2025-05-07T20:32:50.9431114Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.9431555Z 2025-05-07T20:32:50.9431971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0839550Z 2025-05-07T20:32:51.0840019Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0840951Z self=, 2025-05-07T20:32:51.0841757Z T=1, 2025-05-07T20:32:51.0842142Z D=5120, 2025-05-07T20:32:51.0842529Z scale_ub=None, 2025-05-07T20:32:51.0842966Z contiguous=True, 2025-05-07T20:32:51.0843431Z compiled=False, 2025-05-07T20:32:51.0843849Z ) 2025-05-07T20:32:51.0844499Z self = 2025-05-07T20:32:51.0845454Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.0845984Z 2025-05-07T20:32:51.0846145Z @given( 2025-05-07T20:32:51.0846613Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0847240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0847845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0848504Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0849165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0849642Z ) 2025-05-07T20:32:51.0849997Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0850442Z def test_silu_mul_quant( 2025-05-07T20:32:51.0850683Z self, 2025-05-07T20:32:51.0850887Z T: int, 2025-05-07T20:32:51.0851086Z D: int, 2025-05-07T20:32:51.0851308Z scale_ub: Optional[float], 2025-05-07T20:32:51.0851609Z contiguous: bool, 2025-05-07T20:32:51.0851868Z compiled: bool, 2025-05-07T20:32:51.0852172Z ) -> None: 2025-05-07T20:32:51.0852389Z torch.manual_seed(2025) 2025-05-07T20:32:51.0852635Z 2025-05-07T20:32:51.0852912Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0853253Z 2025-05-07T20:32:51.0853450Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0853848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0854159Z x = x_sign * x_clamp 2025-05-07T20:32:51.0854407Z x0 = x[:, :D] 2025-05-07T20:32:51.0854635Z x1 = x[:, D:] 2025-05-07T20:32:51.0854841Z 2025-05-07T20:32:51.0855032Z if contiguous: 2025-05-07T20:32:51.0855267Z x0 = x0.contiguous() 2025-05-07T20:32:51.0855594Z x1 = x1.contiguous() 2025-05-07T20:32:51.0855843Z 2025-05-07T20:32:51.0856043Z if scale_ub is not None: 2025-05-07T20:32:51.0856317Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0856652Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0856967Z ) 2025-05-07T20:32:51.0857161Z else: 2025-05-07T20:32:51.0857376Z scale_ub_tensor = None 2025-05-07T20:32:51.0857633Z 2025-05-07T20:32:51.0857870Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0858185Z op = silu_mul_quant 2025-05-07T20:32:51.0858439Z if compiled: 2025-05-07T20:32:51.0858696Z op = torch.compile(op) 2025-05-07T20:32:51.0858997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0859276Z 2025-05-07T20:32:51.0859473Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0859639Z 2025-05-07T20:32:51.0859739Z moe/activation_test.py:117: 2025-05-07T20:32:51.0860040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0860441Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0860726Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0861418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0862106Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0862648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0863322Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0863984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0864519Z kernel = self.compile( 2025-05-07T20:32:51.0865060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0865715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0866120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0866347Z 2025-05-07T20:32:51.0866562Z self = 2025-05-07T20:32:51.0867626Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0868983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab2c7c0>} 2025-05-07T20:32:51.0870362Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0871370Z context = 2025-05-07T20:32:51.0871656Z 2025-05-07T20:32:51.0871826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0872378Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0872844Z module_map=module_map) 2025-05-07T20:32:51.0873208Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0873562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0873828Z E ^ 2025-05-07T20:32:51.0874292Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0874735Z 2025-05-07T20:32:51.0875147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0875655Z 2025-05-07T20:32:51.0875802Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0876224Z self=, 2025-05-07T20:32:51.0876625Z T=128, 2025-05-07T20:32:51.0876817Z D=5120, 2025-05-07T20:32:51.0877010Z scale_ub=None, 2025-05-07T20:32:51.0877227Z contiguous=False, 2025-05-07T20:32:51.0877459Z compiled=True, 2025-05-07T20:32:51.0877660Z ) 2025-05-07T20:32:51.0877980Z self = 2025-05-07T20:32:51.0878467Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.0878731Z 2025-05-07T20:32:51.0878810Z @given( 2025-05-07T20:32:51.0879044Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0879357Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0879664Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0879995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0880327Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0880663Z ) 2025-05-07T20:32:51.0881007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0881454Z def test_silu_mul_quant( 2025-05-07T20:32:51.0881694Z self, 2025-05-07T20:32:51.0881886Z T: int, 2025-05-07T20:32:51.0882083Z D: int, 2025-05-07T20:32:51.0882300Z scale_ub: Optional[float], 2025-05-07T20:32:51.0882568Z contiguous: bool, 2025-05-07T20:32:51.0882809Z compiled: bool, 2025-05-07T20:32:51.0883030Z ) -> None: 2025-05-07T20:32:51.0883243Z torch.manual_seed(2025) 2025-05-07T20:32:51.0883489Z 2025-05-07T20:32:51.0883761Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0884101Z 2025-05-07T20:32:51.0884297Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0884589Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0884900Z x = x_sign * x_clamp 2025-05-07T20:32:51.0885146Z x0 = x[:, :D] 2025-05-07T20:32:51.0885364Z x1 = x[:, D:] 2025-05-07T20:32:51.0885570Z 2025-05-07T20:32:51.0885758Z if contiguous: 2025-05-07T20:32:51.0885990Z x0 = x0.contiguous() 2025-05-07T20:32:51.0886251Z x1 = x1.contiguous() 2025-05-07T20:32:51.0886489Z 2025-05-07T20:32:51.0886681Z if scale_ub is not None: 2025-05-07T20:32:51.0886952Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0887281Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0887589Z ) 2025-05-07T20:32:51.0887783Z else: 2025-05-07T20:32:51.0888037Z scale_ub_tensor = None 2025-05-07T20:32:51.0888287Z 2025-05-07T20:32:51.0888519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0888831Z op = silu_mul_quant 2025-05-07T20:32:51.0889080Z if compiled: 2025-05-07T20:32:51.0889357Z op = torch.compile(op) 2025-05-07T20:32:51.0889676Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0889958Z 2025-05-07T20:32:51.0890154Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0890316Z 2025-05-07T20:32:51.0890467Z moe/activation_test.py:117: 2025-05-07T20:32:51.0890763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0891095Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0891374Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0891928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0892486Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0893147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0893957Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0894565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0895246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0895908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0896438Z kernel = self.compile( 2025-05-07T20:32:51.0896977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0897631Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0898030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0898508Z 2025-05-07T20:32:51.0898784Z self = 2025-05-07T20:32:51.0900239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0901595Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ba12d40>} 2025-05-07T20:32:51.0902927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0903932Z context = 2025-05-07T20:32:51.0904217Z 2025-05-07T20:32:51.0904391Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0904906Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0905363Z module_map=module_map) 2025-05-07T20:32:51.0905727Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0906088Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0906343Z E ^ 2025-05-07T20:32:51.0906800Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0907245Z 2025-05-07T20:32:51.0907659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0908168Z 2025-05-07T20:32:51.0908276Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0908680Z self=, 2025-05-07T20:32:51.0909149Z T=128, 2025-05-07T20:32:51.0909358Z D=7168, 2025-05-07T20:32:51.0909579Z scale_ub=1200.0, 2025-05-07T20:32:51.0909805Z contiguous=False, 2025-05-07T20:32:51.0910034Z compiled=False, 2025-05-07T20:32:51.2464513Z ) 2025-05-07T20:32:51.2465204Z self = 2025-05-07T20:32:51.2466264Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.2466820Z 2025-05-07T20:32:51.2466977Z @given( 2025-05-07T20:32:51.2467635Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2468258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2468857Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2469433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2469811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2470099Z ) 2025-05-07T20:32:51.2470446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2470886Z def test_silu_mul_quant( 2025-05-07T20:32:51.2471123Z self, 2025-05-07T20:32:51.2471321Z T: int, 2025-05-07T20:32:51.2471518Z D: int, 2025-05-07T20:32:51.2471731Z scale_ub: Optional[float], 2025-05-07T20:32:51.2472067Z contiguous: bool, 2025-05-07T20:32:51.2472315Z compiled: bool, 2025-05-07T20:32:51.2472539Z ) -> None: 2025-05-07T20:32:51.2472760Z torch.manual_seed(2025) 2025-05-07T20:32:51.2473006Z 2025-05-07T20:32:51.2473272Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2473617Z 2025-05-07T20:32:51.2473813Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2474100Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2474405Z x = x_sign * x_clamp 2025-05-07T20:32:51.2474645Z x0 = x[:, :D] 2025-05-07T20:32:51.2474866Z x1 = x[:, D:] 2025-05-07T20:32:51.2475075Z 2025-05-07T20:32:51.2475265Z if contiguous: 2025-05-07T20:32:51.2475500Z x0 = x0.contiguous() 2025-05-07T20:32:51.2475756Z x1 = x1.contiguous() 2025-05-07T20:32:51.2475993Z 2025-05-07T20:32:51.2476189Z if scale_ub is not None: 2025-05-07T20:32:51.2476462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2476858Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2477171Z ) 2025-05-07T20:32:51.2477359Z else: 2025-05-07T20:32:51.2477578Z scale_ub_tensor = None 2025-05-07T20:32:51.2477833Z 2025-05-07T20:32:51.2478059Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2478373Z op = silu_mul_quant 2025-05-07T20:32:51.2478619Z if compiled: 2025-05-07T20:32:51.2478868Z op = torch.compile(op) 2025-05-07T20:32:51.2479162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2479436Z 2025-05-07T20:32:51.2479628Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2479792Z 2025-05-07T20:32:51.2479890Z moe/activation_test.py:117: 2025-05-07T20:32:51.2480182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2480510Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2480789Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2481470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2482151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2482684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2483351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2484011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2484610Z kernel = self.compile( 2025-05-07T20:32:51.2485144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2485799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2486201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2486430Z 2025-05-07T20:32:51.2486640Z self = 2025-05-07T20:32:51.2487697Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2489098Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b34b880>} 2025-05-07T20:32:51.2490425Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2491434Z context = 2025-05-07T20:32:51.2491758Z 2025-05-07T20:32:51.2491927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2492437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2492897Z module_map=module_map) 2025-05-07T20:32:51.2493262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2493613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2493961Z E ^ 2025-05-07T20:32:51.2494415Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2494857Z 2025-05-07T20:32:51.2495276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.2495783Z 2025-05-07T20:32:51.2495890Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2496303Z self=, 2025-05-07T20:32:51.2496698Z T=128, 2025-05-07T20:32:51.2496930Z D=5120, 2025-05-07T20:32:51.2497131Z scale_ub=None, 2025-05-07T20:32:51.2497351Z contiguous=False, 2025-05-07T20:32:51.2497576Z compiled=False, 2025-05-07T20:32:51.2497780Z ) 2025-05-07T20:32:51.2498095Z self = 2025-05-07T20:32:51.2498756Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.2499023Z 2025-05-07T20:32:51.2499102Z @given( 2025-05-07T20:32:51.2499333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2499648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2499952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2500280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2500606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2500886Z ) 2025-05-07T20:32:51.2501243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2501691Z def test_silu_mul_quant( 2025-05-07T20:32:51.2501932Z self, 2025-05-07T20:32:51.2502119Z T: int, 2025-05-07T20:32:51.2502319Z D: int, 2025-05-07T20:32:51.2502538Z scale_ub: Optional[float], 2025-05-07T20:32:51.2502808Z contiguous: bool, 2025-05-07T20:32:51.2503045Z compiled: bool, 2025-05-07T20:32:51.2503266Z ) -> None: 2025-05-07T20:32:51.2503477Z torch.manual_seed(2025) 2025-05-07T20:32:51.2503719Z 2025-05-07T20:32:51.2503989Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2504400Z 2025-05-07T20:32:51.2504594Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2504883Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2505188Z x = x_sign * x_clamp 2025-05-07T20:32:51.2505430Z x0 = x[:, :D] 2025-05-07T20:32:51.2505647Z x1 = x[:, D:] 2025-05-07T20:32:51.2505851Z 2025-05-07T20:32:51.2506042Z if contiguous: 2025-05-07T20:32:51.2506274Z x0 = x0.contiguous() 2025-05-07T20:32:51.2506528Z x1 = x1.contiguous() 2025-05-07T20:32:51.2506833Z 2025-05-07T20:32:51.2507024Z if scale_ub is not None: 2025-05-07T20:32:51.2507297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2507624Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2507931Z ) 2025-05-07T20:32:51.2508127Z else: 2025-05-07T20:32:51.2508332Z scale_ub_tensor = None 2025-05-07T20:32:51.2508584Z 2025-05-07T20:32:51.2508813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2509130Z op = silu_mul_quant 2025-05-07T20:32:51.2509379Z if compiled: 2025-05-07T20:32:51.2509628Z op = torch.compile(op) 2025-05-07T20:32:51.2509922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2510263Z 2025-05-07T20:32:51.2510463Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2510627Z 2025-05-07T20:32:51.2510725Z moe/activation_test.py:117: 2025-05-07T20:32:51.2511019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2511357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2511640Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2512315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2512992Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2513528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2514199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2514857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2515392Z kernel = self.compile( 2025-05-07T20:32:51.2515998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2516647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2517046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2517272Z 2025-05-07T20:32:51.2517482Z self = 2025-05-07T20:32:51.2518543Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2519940Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab2c040>} 2025-05-07T20:32:51.2521268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2522282Z context = 2025-05-07T20:32:51.2522565Z 2025-05-07T20:32:51.2522735Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2523244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2523707Z module_map=module_map) 2025-05-07T20:32:51.2524072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2524472Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2524730Z E ^ 2025-05-07T20:32:51.2525191Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2525635Z 2025-05-07T20:32:51.2526051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.2526557Z 2025-05-07T20:32:51.2526667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2527140Z self=, 2025-05-07T20:32:51.2527541Z T=128, 2025-05-07T20:32:51.2527732Z D=5120, 2025-05-07T20:32:51.2527922Z scale_ub=1200.0, 2025-05-07T20:32:51.2528146Z contiguous=True, 2025-05-07T20:32:51.2528371Z compiled=False, 2025-05-07T20:32:51.2528572Z ) 2025-05-07T20:32:51.2528888Z self = 2025-05-07T20:32:51.2529381Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.2529646Z 2025-05-07T20:32:51.2529724Z @given( 2025-05-07T20:32:51.2529956Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2530315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2530624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2530951Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2531282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2531574Z ) 2025-05-07T20:32:51.2537878Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2538326Z def test_silu_mul_quant( 2025-05-07T20:32:51.2538590Z self, 2025-05-07T20:32:51.2538785Z T: int, 2025-05-07T20:32:51.2538980Z D: int, 2025-05-07T20:32:51.2539194Z scale_ub: Optional[float], 2025-05-07T20:32:51.2539498Z contiguous: bool, 2025-05-07T20:32:51.2539760Z compiled: bool, 2025-05-07T20:32:51.2539978Z ) -> None: 2025-05-07T20:32:51.2540195Z torch.manual_seed(2025) 2025-05-07T20:32:51.2540432Z 2025-05-07T20:32:51.2540702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2541044Z 2025-05-07T20:32:51.2541311Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2541597Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2541908Z x = x_sign * x_clamp 2025-05-07T20:32:51.2542149Z x0 = x[:, :D] 2025-05-07T20:32:51.2542359Z x1 = x[:, D:] 2025-05-07T20:32:51.2542565Z 2025-05-07T20:32:51.2542755Z if contiguous: 2025-05-07T20:32:51.2542980Z x0 = x0.contiguous() 2025-05-07T20:32:51.2543236Z x1 = x1.contiguous() 2025-05-07T20:32:51.2543475Z 2025-05-07T20:32:51.2543663Z if scale_ub is not None: 2025-05-07T20:32:51.2543932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.2544267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.2544569Z ) 2025-05-07T20:32:51.2544756Z else: 2025-05-07T20:32:51.2544962Z scale_ub_tensor = None 2025-05-07T20:32:51.2545216Z 2025-05-07T20:32:51.2545448Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.2545769Z op = silu_mul_quant 2025-05-07T20:32:51.2546016Z if compiled: 2025-05-07T20:32:51.2546258Z op = torch.compile(op) 2025-05-07T20:32:51.2546562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2546837Z 2025-05-07T20:32:51.2547024Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.2547197Z 2025-05-07T20:32:51.2547295Z moe/activation_test.py:117: 2025-05-07T20:32:51.2547590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2547926Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.2548253Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.2548939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.2549669Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.2550209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.2550886Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.2551589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.2552116Z kernel = self.compile( 2025-05-07T20:32:51.2552651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.2553295Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.2553689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.2553919Z 2025-05-07T20:32:51.2554126Z self = 2025-05-07T20:32:51.2555234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.2556581Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d4c20>} 2025-05-07T20:32:51.2557903Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.2558908Z context = 2025-05-07T20:32:51.2559193Z 2025-05-07T20:32:51.2559361Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.2559865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.2560324Z module_map=module_map) 2025-05-07T20:32:51.2560687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.2561084Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.2561338Z E ^ 2025-05-07T20:32:51.2561788Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.2562231Z 2025-05-07T20:32:51.2562644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4092888Z 2025-05-07T20:32:51.4093224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4094980Z self=, 2025-05-07T20:32:51.4096207Z T=1, 2025-05-07T20:32:51.4096714Z D=7168, 2025-05-07T20:32:51.4097109Z scale_ub=1200.0, 2025-05-07T20:32:51.4097565Z contiguous=True, 2025-05-07T20:32:51.4098009Z compiled=True, 2025-05-07T20:32:51.4098878Z ) 2025-05-07T20:32:51.4099366Z self = 2025-05-07T20:32:51.4099873Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.4100137Z 2025-05-07T20:32:51.4100233Z @given( 2025-05-07T20:32:51.4100476Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4100802Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4101121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4101449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4101787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4102085Z ) 2025-05-07T20:32:51.4102436Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4103183Z def test_silu_mul_quant( 2025-05-07T20:32:51.4103434Z self, 2025-05-07T20:32:51.4103643Z T: int, 2025-05-07T20:32:51.4103845Z D: int, 2025-05-07T20:32:51.4104074Z scale_ub: Optional[float], 2025-05-07T20:32:51.4104356Z contiguous: bool, 2025-05-07T20:32:51.4104598Z compiled: bool, 2025-05-07T20:32:51.4104838Z ) -> None: 2025-05-07T20:32:51.4105064Z torch.manual_seed(2025) 2025-05-07T20:32:51.4105401Z 2025-05-07T20:32:51.4105686Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4106039Z 2025-05-07T20:32:51.4106236Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4106539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4106858Z x = x_sign * x_clamp 2025-05-07T20:32:51.4107103Z x0 = x[:, :D] 2025-05-07T20:32:51.4107331Z x1 = x[:, D:] 2025-05-07T20:32:51.4107554Z 2025-05-07T20:32:51.4107740Z if contiguous: 2025-05-07T20:32:51.4107985Z x0 = x0.contiguous() 2025-05-07T20:32:51.4108248Z x1 = x1.contiguous() 2025-05-07T20:32:51.4108488Z 2025-05-07T20:32:51.4108691Z if scale_ub is not None: 2025-05-07T20:32:51.4109067Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4109419Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4109726Z ) 2025-05-07T20:32:51.4109928Z else: 2025-05-07T20:32:51.4110152Z scale_ub_tensor = None 2025-05-07T20:32:51.4110406Z 2025-05-07T20:32:51.4110647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4110967Z op = silu_mul_quant 2025-05-07T20:32:51.4111223Z if compiled: 2025-05-07T20:32:51.4111482Z op = torch.compile(op) 2025-05-07T20:32:51.4111787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4112061Z 2025-05-07T20:32:51.4112273Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4112437Z 2025-05-07T20:32:51.4112549Z moe/activation_test.py:117: 2025-05-07T20:32:51.4112855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4113188Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4113484Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4114158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4114721Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4115385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4116071Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4116604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4117282Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4117946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4118486Z kernel = self.compile( 2025-05-07T20:32:51.4119027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4119736Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4120140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4120371Z 2025-05-07T20:32:51.4120591Z self = 2025-05-07T20:32:51.4121657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4123076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d5ee0>} 2025-05-07T20:32:51.4124412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4125426Z context = 2025-05-07T20:32:51.4125711Z 2025-05-07T20:32:51.4125934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4126448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4126922Z module_map=module_map) 2025-05-07T20:32:51.4127297Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4127647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4127920Z E ^ 2025-05-07T20:32:51.4128389Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4128831Z 2025-05-07T20:32:51.4129294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4129797Z 2025-05-07T20:32:51.4129908Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4130327Z self=, 2025-05-07T20:32:51.4130741Z T=1, 2025-05-07T20:32:51.4130929Z D=7168, 2025-05-07T20:32:51.4131133Z scale_ub=1200.0, 2025-05-07T20:32:51.4131367Z contiguous=False, 2025-05-07T20:32:51.4131592Z compiled=True, 2025-05-07T20:32:51.4131807Z ) 2025-05-07T20:32:51.4132132Z self = 2025-05-07T20:32:51.4132611Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.4132884Z 2025-05-07T20:32:51.4132964Z @given( 2025-05-07T20:32:51.4133203Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4133523Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4133914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4134252Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4134640Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4134928Z ) 2025-05-07T20:32:51.4135282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4135732Z def test_silu_mul_quant( 2025-05-07T20:32:51.4135973Z self, 2025-05-07T20:32:51.4136177Z T: int, 2025-05-07T20:32:51.4136388Z D: int, 2025-05-07T20:32:51.4136605Z scale_ub: Optional[float], 2025-05-07T20:32:51.4136886Z contiguous: bool, 2025-05-07T20:32:51.4137136Z compiled: bool, 2025-05-07T20:32:51.4137365Z ) -> None: 2025-05-07T20:32:51.4137583Z torch.manual_seed(2025) 2025-05-07T20:32:51.4137830Z 2025-05-07T20:32:51.4138109Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4138448Z 2025-05-07T20:32:51.4138652Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4138954Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4139270Z x = x_sign * x_clamp 2025-05-07T20:32:51.4139522Z x0 = x[:, :D] 2025-05-07T20:32:51.4139746Z x1 = x[:, D:] 2025-05-07T20:32:51.4139961Z 2025-05-07T20:32:51.4140161Z if contiguous: 2025-05-07T20:32:51.4140403Z x0 = x0.contiguous() 2025-05-07T20:32:51.4140661Z x1 = x1.contiguous() 2025-05-07T20:32:51.4140912Z 2025-05-07T20:32:51.4141116Z if scale_ub is not None: 2025-05-07T20:32:51.4141410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4141751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4142111Z ) 2025-05-07T20:32:51.4142314Z else: 2025-05-07T20:32:51.4142535Z scale_ub_tensor = None 2025-05-07T20:32:51.4142786Z 2025-05-07T20:32:51.4143033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4143355Z op = silu_mul_quant 2025-05-07T20:32:51.4143602Z if compiled: 2025-05-07T20:32:51.4143859Z op = torch.compile(op) 2025-05-07T20:32:51.4144162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4144444Z 2025-05-07T20:32:51.4144722Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4144893Z 2025-05-07T20:32:51.4144993Z moe/activation_test.py:117: 2025-05-07T20:32:51.4145297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4145626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4145916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4146480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.4147033Z return fn(*args, **kwargs) 2025-05-07T20:32:51.4147697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4148431Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4148981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4149679Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4150365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4150899Z kernel = self.compile( 2025-05-07T20:32:51.4151440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4152087Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4152493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4152720Z 2025-05-07T20:32:51.4152936Z self = 2025-05-07T20:32:51.4154048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4155395Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d6c00>} 2025-05-07T20:32:51.4156726Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4157739Z context = 2025-05-07T20:32:51.4158027Z 2025-05-07T20:32:51.4158203Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4158716Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4159187Z module_map=module_map) 2025-05-07T20:32:51.4159590Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4159968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4160232Z E ^ 2025-05-07T20:32:51.4160703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4161147Z 2025-05-07T20:32:51.4161565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.6204077Z 2025-05-07T20:32:51.6204997Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6205875Z self=, 2025-05-07T20:32:51.6207708Z T=1, 2025-05-07T20:32:51.6208073Z D=7168, 2025-05-07T20:32:51.6208449Z scale_ub=None, 2025-05-07T20:32:51.6208864Z contiguous=False, 2025-05-07T20:32:51.6209300Z compiled=True, 2025-05-07T20:32:51.6209600Z ) 2025-05-07T20:32:51.6209971Z self = 2025-05-07T20:32:51.6210456Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.6210811Z 2025-05-07T20:32:51.6210888Z @given( 2025-05-07T20:32:51.6211114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6211420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6211720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6212041Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6212360Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6212641Z ) 2025-05-07T20:32:51.6212988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6213419Z def test_silu_mul_quant( 2025-05-07T20:32:51.6213653Z self, 2025-05-07T20:32:51.6213957Z T: int, 2025-05-07T20:32:51.6214148Z D: int, 2025-05-07T20:32:51.6214436Z scale_ub: Optional[float], 2025-05-07T20:32:51.6214706Z contiguous: bool, 2025-05-07T20:32:51.6214939Z compiled: bool, 2025-05-07T20:32:51.6215154Z ) -> None: 2025-05-07T20:32:51.6215367Z torch.manual_seed(2025) 2025-05-07T20:32:51.6215604Z 2025-05-07T20:32:51.6215866Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6216204Z 2025-05-07T20:32:51.6216391Z x_sign = torch.sign(x) 2025-05-07T20:32:51.6216669Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.6216971Z x = x_sign * x_clamp 2025-05-07T20:32:51.6217204Z x0 = x[:, :D] 2025-05-07T20:32:51.6217417Z x1 = x[:, D:] 2025-05-07T20:32:51.6217626Z 2025-05-07T20:32:51.6217806Z if contiguous: 2025-05-07T20:32:51.6218028Z x0 = x0.contiguous() 2025-05-07T20:32:51.6218278Z x1 = x1.contiguous() 2025-05-07T20:32:51.6218513Z 2025-05-07T20:32:51.6218700Z if scale_ub is not None: 2025-05-07T20:32:51.6219043Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.6219375Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.6219727Z ) 2025-05-07T20:32:51.6219915Z else: 2025-05-07T20:32:51.6220123Z scale_ub_tensor = None 2025-05-07T20:32:51.6220365Z 2025-05-07T20:32:51.6220622Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.6220924Z op = silu_mul_quant 2025-05-07T20:32:51.6221166Z if compiled: 2025-05-07T20:32:51.6221407Z op = torch.compile(op) 2025-05-07T20:32:51.6221698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.6221970Z 2025-05-07T20:32:51.6222158Z y_fp8, y_scale = fn() 2025-05-07T20:32:51.6222434Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:51.6222716Z 2025-05-07T20:32:51.6222947Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.6223275Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:51.6223559Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:51.6223861Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:51.6224215Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.6224517Z 2025-05-07T20:32:51.6224712Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:51.6224903Z 2025-05-07T20:32:51.6225003Z moe/activation_test.py:126: 2025-05-07T20:32:51.6225303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.6225635Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:51.6226003Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:51.6226780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:51.6227514Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:51.6228050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.6228718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.6229444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:51.6230149Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:51.6230854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:51.6232639Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:51.6233228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:51.6233735Z fn() 2025-05-07T20:32:51.6234271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:51.6234841Z self.fn.run( 2025-05-07T20:32:51.6235292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.6235810Z kernel = self.compile( 2025-05-07T20:32:51.6236341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.6236979Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.6237362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.6237588Z 2025-05-07T20:32:51.6237795Z self = 2025-05-07T20:32:51.6238858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.6240356Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc00180>} 2025-05-07T20:32:51.6241677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.6242670Z context = 2025-05-07T20:32:51.6242954Z 2025-05-07T20:32:51.6243116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.6243621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.6244074Z module_map=module_map) 2025-05-07T20:32:51.6244425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.6244778Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:51.6245040Z E ^ 2025-05-07T20:32:51.6245481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.6245922Z 2025-05-07T20:32:51.6246332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.6246835Z 2025-05-07T20:32:51.6246934Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6247330Z self=, 2025-05-07T20:32:51.6247713Z T=1, 2025-05-07T20:32:51.6247889Z D=5120, 2025-05-07T20:32:51.6248129Z scale_ub=1200.0, 2025-05-07T20:32:51.6248339Z contiguous=False, 2025-05-07T20:32:51.6248553Z compiled=True, 2025-05-07T20:32:51.6248745Z ) 2025-05-07T20:32:51.6249049Z self = 2025-05-07T20:32:51.6249524Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.6249790Z 2025-05-07T20:32:51.6249880Z @given( 2025-05-07T20:32:51.6250134Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6250481Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6250777Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6251093Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6251408Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6251686Z ) 2025-05-07T20:32:51.6252021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6252445Z def test_silu_mul_quant( 2025-05-07T20:32:51.6252681Z self, 2025-05-07T20:32:51.6252866Z T: int, 2025-05-07T20:32:51.6253049Z D: int, 2025-05-07T20:32:51.6253257Z scale_ub: Optional[float], 2025-05-07T20:32:51.6253516Z contiguous: bool, 2025-05-07T20:32:51.6253854Z compiled: bool, 2025-05-07T20:32:51.6254064Z ) -> None: 2025-05-07T20:32:51.6254271Z torch.manual_seed(2025) 2025-05-07T20:32:51.6254502Z 2025-05-07T20:32:51.6254760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6255094Z 2025-05-07T20:32:51.6255277Z x_sign = torch.sign(x) 2025-05-07T20:32:51.6255552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.6255847Z x = x_sign * x_clamp 2025-05-07T20:32:51.6256078Z x0 = x[:, :D] 2025-05-07T20:32:51.6256281Z x1 = x[:, D:] 2025-05-07T20:32:51.6256477Z 2025-05-07T20:32:51.6256652Z if contiguous: 2025-05-07T20:32:51.6256874Z x0 = x0.contiguous() 2025-05-07T20:32:51.6268056Z x1 = x1.contiguous() 2025-05-07T20:32:51.6268452Z 2025-05-07T20:32:51.6268645Z if scale_ub is not None: 2025-05-07T20:32:51.6268942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.6269285Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.6269594Z ) 2025-05-07T20:32:51.6269858Z else: 2025-05-07T20:32:51.6270064Z scale_ub_tensor = None 2025-05-07T20:32:51.6270310Z 2025-05-07T20:32:51.6270542Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.6270855Z op = silu_mul_quant 2025-05-07T20:32:51.6271105Z if compiled: 2025-05-07T20:32:51.6271355Z op = torch.compile(op) 2025-05-07T20:32:51.6271653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.6271926Z 2025-05-07T20:32:51.6272108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.6272270Z 2025-05-07T20:32:51.6272375Z moe/activation_test.py:117: 2025-05-07T20:32:51.6272670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.6273023Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.6273323Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.6273883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.6274434Z return fn(*args, **kwargs) 2025-05-07T20:32:51.6275079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.6275758Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.6276283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.6276945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.6277590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.6278167Z kernel = self.compile( 2025-05-07T20:32:51.6278698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.6279346Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.6279795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.6280027Z 2025-05-07T20:32:51.6280234Z self = 2025-05-07T20:32:51.6281347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.6282691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc01300>} 2025-05-07T20:32:51.6284183Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.6289043Z context = 2025-05-07T20:32:51.6289450Z 2025-05-07T20:32:51.6289699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.6290498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.6291204Z module_map=module_map) 2025-05-07T20:32:51.6291741Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.6292254Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.6292627Z E ^ 2025-05-07T20:32:51.6293328Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.6294197Z 2025-05-07T20:32:51.6294844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7677264Z 2025-05-07T20:32:51.7677425Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7678032Z self=, 2025-05-07T20:32:51.7678454Z T=1, 2025-05-07T20:32:51.7678643Z D=5120, 2025-05-07T20:32:51.7678843Z scale_ub=1200.0, 2025-05-07T20:32:51.7679068Z contiguous=False, 2025-05-07T20:32:51.7679304Z compiled=False, 2025-05-07T20:32:51.7679517Z ) 2025-05-07T20:32:51.7679831Z self = 2025-05-07T20:32:51.7680313Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.7680582Z 2025-05-07T20:32:51.7680660Z @given( 2025-05-07T20:32:51.7680893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7681203Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7681546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7681878Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7682208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7682493Z ) 2025-05-07T20:32:51.7682841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7683280Z def test_silu_mul_quant( 2025-05-07T20:32:51.7683526Z self, 2025-05-07T20:32:51.7683724Z T: int, 2025-05-07T20:32:51.7683921Z D: int, 2025-05-07T20:32:51.7684136Z scale_ub: Optional[float], 2025-05-07T20:32:51.7684400Z contiguous: bool, 2025-05-07T20:32:51.7684639Z compiled: bool, 2025-05-07T20:32:51.7684870Z ) -> None: 2025-05-07T20:32:51.7685090Z torch.manual_seed(2025) 2025-05-07T20:32:51.7685339Z 2025-05-07T20:32:51.7685691Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7686031Z 2025-05-07T20:32:51.7686237Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7686533Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7686848Z x = x_sign * x_clamp 2025-05-07T20:32:51.7687093Z x0 = x[:, :D] 2025-05-07T20:32:51.7687319Z x1 = x[:, D:] 2025-05-07T20:32:51.7687533Z 2025-05-07T20:32:51.7687720Z if contiguous: 2025-05-07T20:32:51.7687955Z x0 = x0.contiguous() 2025-05-07T20:32:51.7688295Z x1 = x1.contiguous() 2025-05-07T20:32:51.7688534Z 2025-05-07T20:32:51.7688734Z if scale_ub is not None: 2025-05-07T20:32:51.7689010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7689341Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7689653Z ) 2025-05-07T20:32:51.7689852Z else: 2025-05-07T20:32:51.7690062Z scale_ub_tensor = None 2025-05-07T20:32:51.7690322Z 2025-05-07T20:32:51.7690558Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7690883Z op = silu_mul_quant 2025-05-07T20:32:51.7691133Z if compiled: 2025-05-07T20:32:51.7691462Z op = torch.compile(op) 2025-05-07T20:32:51.7691766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7692043Z 2025-05-07T20:32:51.7692242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7692405Z 2025-05-07T20:32:51.7692522Z moe/activation_test.py:117: 2025-05-07T20:32:51.7692817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7693157Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7693442Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7694198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7694885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7695423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7696100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7696804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7697341Z kernel = self.compile( 2025-05-07T20:32:51.7697881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7698687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7699080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7699312Z 2025-05-07T20:32:51.7699522Z self = 2025-05-07T20:32:51.7700643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7702009Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc02020>} 2025-05-07T20:32:51.7703328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7704344Z context = 2025-05-07T20:32:51.7704640Z 2025-05-07T20:32:51.7704804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7705364Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7705896Z module_map=module_map) 2025-05-07T20:32:51.7706266Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7706619Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7706875Z E ^ 2025-05-07T20:32:51.7707342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7707794Z 2025-05-07T20:32:51.7708206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7708811Z 2025-05-07T20:32:51.7708923Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7709327Z self=, 2025-05-07T20:32:51.7709727Z T=16384, 2025-05-07T20:32:51.7709925Z D=5120, 2025-05-07T20:32:51.7710112Z scale_ub=1200.0, 2025-05-07T20:32:51.7710337Z contiguous=False, 2025-05-07T20:32:51.7710565Z compiled=True, 2025-05-07T20:32:51.7710771Z ) 2025-05-07T20:32:51.7711077Z self = 2025-05-07T20:32:51.7711567Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.7711841Z 2025-05-07T20:32:51.7711921Z @given( 2025-05-07T20:32:51.7712214Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7712532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7712836Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7713158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7713483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7713765Z ) 2025-05-07T20:32:51.7714105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7714548Z def test_silu_mul_quant( 2025-05-07T20:32:51.7714780Z self, 2025-05-07T20:32:51.7714974Z T: int, 2025-05-07T20:32:51.7715163Z D: int, 2025-05-07T20:32:51.7715378Z scale_ub: Optional[float], 2025-05-07T20:32:51.7715647Z contiguous: bool, 2025-05-07T20:32:51.7715876Z compiled: bool, 2025-05-07T20:32:51.7716095Z ) -> None: 2025-05-07T20:32:51.7716308Z torch.manual_seed(2025) 2025-05-07T20:32:51.7716541Z 2025-05-07T20:32:51.7716872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7717213Z 2025-05-07T20:32:51.7717397Z x_sign = torch.sign(x) 2025-05-07T20:32:51.7717692Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.7718000Z x = x_sign * x_clamp 2025-05-07T20:32:51.7718233Z x0 = x[:, :D] 2025-05-07T20:32:51.7718450Z x1 = x[:, D:] 2025-05-07T20:32:51.7718655Z 2025-05-07T20:32:51.7718833Z if contiguous: 2025-05-07T20:32:51.7719063Z x0 = x0.contiguous() 2025-05-07T20:32:51.7719317Z x1 = x1.contiguous() 2025-05-07T20:32:51.7719563Z 2025-05-07T20:32:51.7719784Z if scale_ub is not None: 2025-05-07T20:32:51.7720062Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.7720393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.7720691Z ) 2025-05-07T20:32:51.7720885Z else: 2025-05-07T20:32:51.7721101Z scale_ub_tensor = None 2025-05-07T20:32:51.7721353Z 2025-05-07T20:32:51.7721586Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.7721901Z op = silu_mul_quant 2025-05-07T20:32:51.7722146Z if compiled: 2025-05-07T20:32:51.7722394Z op = torch.compile(op) 2025-05-07T20:32:51.7722690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7722960Z 2025-05-07T20:32:51.7723151Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.7723312Z 2025-05-07T20:32:51.7723415Z moe/activation_test.py:117: 2025-05-07T20:32:51.7723709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7724085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.7724365Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.7724921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.7725470Z return fn(*args, **kwargs) 2025-05-07T20:32:51.7726120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.7726841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.7727374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.7728034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.7728688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.7729213Z kernel = self.compile( 2025-05-07T20:32:51.7729789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.7730436Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.7730869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.7731091Z 2025-05-07T20:32:51.7731303Z self = 2025-05-07T20:32:51.7732351Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.7733764Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc03600>} 2025-05-07T20:32:51.7735084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.7736086Z context = 2025-05-07T20:32:51.7736366Z 2025-05-07T20:32:51.7736583Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.7737087Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.7737547Z module_map=module_map) 2025-05-07T20:32:51.7737908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.7738252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.7738510Z E ^ 2025-05-07T20:32:51.7738964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.7739402Z 2025-05-07T20:32:51.7739816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.7740318Z 2025-05-07T20:32:51.7740422Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7740829Z self=, 2025-05-07T20:32:51.7741230Z T=2048, 2025-05-07T20:32:51.7741414Z D=7168, 2025-05-07T20:32:51.7741606Z scale_ub=1200.0, 2025-05-07T20:32:51.7741829Z contiguous=False, 2025-05-07T20:32:51.7742044Z compiled=True, 2025-05-07T20:32:51.9606197Z ) 2025-05-07T20:32:51.9606547Z self = 2025-05-07T20:32:51.9607047Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.9607346Z 2025-05-07T20:32:51.9607433Z @given( 2025-05-07T20:32:51.9607665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9607988Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9608412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9608741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9609085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9609379Z ) 2025-05-07T20:32:51.9609723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9610169Z def test_silu_mul_quant( 2025-05-07T20:32:51.9610412Z self, 2025-05-07T20:32:51.9610609Z T: int, 2025-05-07T20:32:51.9610872Z D: int, 2025-05-07T20:32:51.9611092Z scale_ub: Optional[float], 2025-05-07T20:32:51.9611368Z contiguous: bool, 2025-05-07T20:32:51.9611602Z compiled: bool, 2025-05-07T20:32:51.9611830Z ) -> None: 2025-05-07T20:32:51.9612048Z torch.manual_seed(2025) 2025-05-07T20:32:51.9612286Z 2025-05-07T20:32:51.9612563Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9612909Z 2025-05-07T20:32:51.9613103Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9613392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9613784Z x = x_sign * x_clamp 2025-05-07T20:32:51.9614012Z x0 = x[:, :D] 2025-05-07T20:32:51.9614221Z x1 = x[:, D:] 2025-05-07T20:32:51.9614499Z 2025-05-07T20:32:51.9614678Z if contiguous: 2025-05-07T20:32:51.9614904Z x0 = x0.contiguous() 2025-05-07T20:32:51.9615156Z x1 = x1.contiguous() 2025-05-07T20:32:51.9615392Z 2025-05-07T20:32:51.9615577Z if scale_ub is not None: 2025-05-07T20:32:51.9615848Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.9616180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.9616482Z ) 2025-05-07T20:32:51.9616678Z else: 2025-05-07T20:32:51.9616885Z scale_ub_tensor = None 2025-05-07T20:32:51.9617129Z 2025-05-07T20:32:51.9617358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.9617669Z op = silu_mul_quant 2025-05-07T20:32:51.9617909Z if compiled: 2025-05-07T20:32:51.9618155Z op = torch.compile(op) 2025-05-07T20:32:51.9618448Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9618717Z 2025-05-07T20:32:51.9618913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.9619141Z 2025-05-07T20:32:51.9619244Z moe/activation_test.py:117: 2025-05-07T20:32:51.9619537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9619862Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.9620141Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9620696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.9621241Z return fn(*args, **kwargs) 2025-05-07T20:32:51.9621888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.9622565Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.9623095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.9623763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.9624421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.9624951Z kernel = self.compile( 2025-05-07T20:32:51.9625479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.9626127Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.9626520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9626744Z 2025-05-07T20:32:51.9626952Z self = 2025-05-07T20:32:51.9628049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.9629401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a424720>} 2025-05-07T20:32:51.9630800Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.9631853Z context = 2025-05-07T20:32:51.9632132Z 2025-05-07T20:32:51.9632300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.9632816Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.9633285Z module_map=module_map) 2025-05-07T20:32:51.9633653Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.9634004Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.9634304Z E ^ 2025-05-07T20:32:51.9634775Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.9635215Z 2025-05-07T20:32:51.9635658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.9636179Z 2025-05-07T20:32:51.9636283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9636691Z self=, 2025-05-07T20:32:51.9637086Z T=1, 2025-05-07T20:32:51.9637262Z D=5120, 2025-05-07T20:32:51.9637453Z scale_ub=None, 2025-05-07T20:32:51.9637701Z contiguous=False, 2025-05-07T20:32:51.9637931Z compiled=False, 2025-05-07T20:32:51.9638133Z ) 2025-05-07T20:32:51.9638437Z self = 2025-05-07T20:32:51.9638917Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.9639180Z 2025-05-07T20:32:51.9639260Z @given( 2025-05-07T20:32:51.9639540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9639844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9640145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9640470Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9640784Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9641064Z ) 2025-05-07T20:32:51.9641409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9641839Z def test_silu_mul_quant( 2025-05-07T20:32:51.9642080Z self, 2025-05-07T20:32:51.9642272Z T: int, 2025-05-07T20:32:51.9642458Z D: int, 2025-05-07T20:32:51.9642674Z scale_ub: Optional[float], 2025-05-07T20:32:51.9642943Z contiguous: bool, 2025-05-07T20:32:51.9643176Z compiled: bool, 2025-05-07T20:32:51.9643397Z ) -> None: 2025-05-07T20:32:51.9643612Z torch.manual_seed(2025) 2025-05-07T20:32:51.9643847Z 2025-05-07T20:32:51.9644115Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9644454Z 2025-05-07T20:32:51.9644647Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9644928Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9645236Z x = x_sign * x_clamp 2025-05-07T20:32:51.9645469Z x0 = x[:, :D] 2025-05-07T20:32:51.9645680Z x1 = x[:, D:] 2025-05-07T20:32:51.9645884Z 2025-05-07T20:32:51.9646066Z if contiguous: 2025-05-07T20:32:51.9646289Z x0 = x0.contiguous() 2025-05-07T20:32:51.9646628Z x1 = x1.contiguous() 2025-05-07T20:32:51.9646860Z 2025-05-07T20:32:51.9647044Z if scale_ub is not None: 2025-05-07T20:32:51.9647315Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.9647642Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.9647942Z ) 2025-05-07T20:32:51.9648136Z else: 2025-05-07T20:32:51.9648345Z scale_ub_tensor = None 2025-05-07T20:32:51.9648585Z 2025-05-07T20:32:51.9648813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.9649179Z op = silu_mul_quant 2025-05-07T20:32:51.9649426Z if compiled: 2025-05-07T20:32:51.9649664Z op = torch.compile(op) 2025-05-07T20:32:51.9649956Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9650228Z 2025-05-07T20:32:51.9650409Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.9650576Z 2025-05-07T20:32:51.9650674Z moe/activation_test.py:117: 2025-05-07T20:32:51.9650970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9651298Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.9651576Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9652292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.9652971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.9653492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.9654241Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.9654891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.9655409Z kernel = self.compile( 2025-05-07T20:32:51.9655941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.9656592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.9656985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9657207Z 2025-05-07T20:32:51.9657414Z self = 2025-05-07T20:32:51.9658514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.9659916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a425120>} 2025-05-07T20:32:51.9661231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.9662232Z context = 2025-05-07T20:32:51.9668976Z 2025-05-07T20:32:51.9669238Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.9669877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.9670346Z module_map=module_map) 2025-05-07T20:32:51.9670705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.9671058Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.9671323Z E ^ 2025-05-07T20:32:51.9671776Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.9672222Z 2025-05-07T20:32:51.9672641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.9673248Z 2025-05-07T20:32:51.9673352Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9673761Z self=, 2025-05-07T20:32:51.9674160Z T=4096, 2025-05-07T20:32:51.9674352Z D=7168, 2025-05-07T20:32:51.9674549Z scale_ub=1200.0, 2025-05-07T20:32:51.9674767Z contiguous=False, 2025-05-07T20:32:51.9674997Z compiled=False, 2025-05-07T20:32:51.9675205Z ) 2025-05-07T20:32:51.9675516Z self = 2025-05-07T20:32:51.9676052Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.9676325Z 2025-05-07T20:32:51.9676409Z @given( 2025-05-07T20:32:51.9676638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9676946Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9677246Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9677575Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9677901Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9678185Z ) 2025-05-07T20:32:51.9678528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9678959Z def test_silu_mul_quant( 2025-05-07T20:32:51.9679265Z self, 2025-05-07T20:32:51.9679468Z T: int, 2025-05-07T20:32:51.9679672Z D: int, 2025-05-07T20:32:51.9679903Z scale_ub: Optional[float], 2025-05-07T20:32:51.9680199Z contiguous: bool, 2025-05-07T20:32:51.9680452Z compiled: bool, 2025-05-07T20:32:51.9680688Z ) -> None: 2025-05-07T20:32:51.9680910Z torch.manual_seed(2025) 2025-05-07T20:32:51.9681172Z 2025-05-07T20:32:51.9681462Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9681844Z 2025-05-07T20:32:51.9682045Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9682360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9682704Z x = x_sign * x_clamp 2025-05-07T20:32:51.9682965Z x0 = x[:, :D] 2025-05-07T20:32:51.9683187Z x1 = x[:, D:] 2025-05-07T20:32:51.9683410Z 2025-05-07T20:32:51.9683599Z if contiguous: 2025-05-07T20:32:51.9683851Z x0 = x0.contiguous() 2025-05-07T20:32:51.9684134Z x1 = x1.contiguous() 2025-05-07T20:32:51.9684441Z 2025-05-07T20:32:51.9684644Z if scale_ub is not None: 2025-05-07T20:32:51.9684938Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.9685310Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.9685649Z ) 2025-05-07T20:32:51.9685846Z else: 2025-05-07T20:32:51.9686065Z scale_ub_tensor = None 2025-05-07T20:32:51.9686336Z 2025-05-07T20:32:51.9686575Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.9686919Z op = silu_mul_quant 2025-05-07T20:32:51.9687186Z if compiled: 2025-05-07T20:32:51.9687450Z op = torch.compile(op) 2025-05-07T20:32:51.9687772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9688078Z 2025-05-07T20:32:51.9688273Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.9688453Z 2025-05-07T20:32:51.9688557Z moe/activation_test.py:117: 2025-05-07T20:32:51.9688886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9689255Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.9689561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9690375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.9691199Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.9691822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.9692623Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.9693459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.9694101Z kernel = self.compile( 2025-05-07T20:32:51.9694635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.9695288Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.9695684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9695955Z 2025-05-07T20:32:51.9696165Z self = 2025-05-07T20:32:51.9697222Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.9698836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a426480>} 2025-05-07T20:32:51.9700298Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.9701302Z context = 2025-05-07T20:32:51.9701582Z 2025-05-07T20:32:51.9701752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.9702258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.9702717Z module_map=module_map) 2025-05-07T20:32:51.9703078Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.9703425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.9703681Z E ^ 2025-05-07T20:32:51.9704140Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.9704577Z 2025-05-07T20:32:51.9704994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1249347Z 2025-05-07T20:32:52.1249939Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1250853Z self=, 2025-05-07T20:32:52.1251660Z T=16384, 2025-05-07T20:32:52.1252051Z D=7168, 2025-05-07T20:32:52.1252432Z scale_ub=None, 2025-05-07T20:32:52.1252858Z contiguous=True, 2025-05-07T20:32:52.1253303Z compiled=True, 2025-05-07T20:32:52.1253823Z ) 2025-05-07T20:32:52.1254449Z self = 2025-05-07T20:32:52.1255415Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.1255959Z 2025-05-07T20:32:52.1256123Z @given( 2025-05-07T20:32:52.1256573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1257195Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1257796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1258449Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1259104Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1259616Z ) 2025-05-07T20:32:52.1260007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1260455Z def test_silu_mul_quant( 2025-05-07T20:32:52.1260699Z self, 2025-05-07T20:32:52.1260897Z T: int, 2025-05-07T20:32:52.1261093Z D: int, 2025-05-07T20:32:52.1261314Z scale_ub: Optional[float], 2025-05-07T20:32:52.1261586Z contiguous: bool, 2025-05-07T20:32:52.1261825Z compiled: bool, 2025-05-07T20:32:52.1262052Z ) -> None: 2025-05-07T20:32:52.1262343Z torch.manual_seed(2025) 2025-05-07T20:32:52.1262583Z 2025-05-07T20:32:52.1262857Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1263201Z 2025-05-07T20:32:52.1263393Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1263686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1264001Z x = x_sign * x_clamp 2025-05-07T20:32:52.1264238Z x0 = x[:, :D] 2025-05-07T20:32:52.1264457Z x1 = x[:, D:] 2025-05-07T20:32:52.1264672Z 2025-05-07T20:32:52.1264928Z if contiguous: 2025-05-07T20:32:52.1265161Z x0 = x0.contiguous() 2025-05-07T20:32:52.1265417Z x1 = x1.contiguous() 2025-05-07T20:32:52.1265657Z 2025-05-07T20:32:52.1265845Z if scale_ub is not None: 2025-05-07T20:32:52.1266121Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1266455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1266757Z ) 2025-05-07T20:32:52.1266958Z else: 2025-05-07T20:32:52.1267173Z scale_ub_tensor = None 2025-05-07T20:32:52.1267423Z 2025-05-07T20:32:52.1267651Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1267966Z op = silu_mul_quant 2025-05-07T20:32:52.1268304Z if compiled: 2025-05-07T20:32:52.1268560Z op = torch.compile(op) 2025-05-07T20:32:52.1268857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1269128Z 2025-05-07T20:32:52.1269329Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1269494Z 2025-05-07T20:32:52.1269603Z moe/activation_test.py:117: 2025-05-07T20:32:52.1269925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1270275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1270559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1271116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.1271670Z return fn(*args, **kwargs) 2025-05-07T20:32:52.1272320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1273005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1273588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1274255Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1274912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1275441Z kernel = self.compile( 2025-05-07T20:32:52.1275971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1276621Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1277018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1277244Z 2025-05-07T20:32:52.1277452Z self = 2025-05-07T20:32:52.1278514Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1279887Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a427740>} 2025-05-07T20:32:52.1281233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1282234Z context = 2025-05-07T20:32:52.1282563Z 2025-05-07T20:32:52.1282732Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1283241Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1283708Z module_map=module_map) 2025-05-07T20:32:52.1284074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1284423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1284687Z E ^ 2025-05-07T20:32:52.1285190Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1285628Z 2025-05-07T20:32:52.1286040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.1286545Z 2025-05-07T20:32:52.1286650Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.1287057Z self=, 2025-05-07T20:32:52.1287458Z T=4096, 2025-05-07T20:32:52.1287645Z D=5120, 2025-05-07T20:32:52.1287845Z scale_ub=None, 2025-05-07T20:32:52.1288064Z contiguous=False, 2025-05-07T20:32:52.1288289Z compiled=True, 2025-05-07T20:32:52.1288536Z ) 2025-05-07T20:32:52.1288856Z self = 2025-05-07T20:32:52.1289343Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.1289618Z 2025-05-07T20:32:52.1289736Z @given( 2025-05-07T20:32:52.1290006Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.1290328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.1290634Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.1290956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.1291282Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.1291581Z ) 2025-05-07T20:32:52.1291927Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.1292366Z def test_silu_mul_quant( 2025-05-07T20:32:52.1292601Z self, 2025-05-07T20:32:52.1292794Z T: int, 2025-05-07T20:32:52.1292991Z D: int, 2025-05-07T20:32:52.1293207Z scale_ub: Optional[float], 2025-05-07T20:32:52.1293522Z contiguous: bool, 2025-05-07T20:32:52.1293843Z compiled: bool, 2025-05-07T20:32:52.1294061Z ) -> None: 2025-05-07T20:32:52.1294280Z torch.manual_seed(2025) 2025-05-07T20:32:52.1294518Z 2025-05-07T20:32:52.1294782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.1295127Z 2025-05-07T20:32:52.1295320Z x_sign = torch.sign(x) 2025-05-07T20:32:52.1295602Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.1295910Z x = x_sign * x_clamp 2025-05-07T20:32:52.1296148Z x0 = x[:, :D] 2025-05-07T20:32:52.1296366Z x1 = x[:, D:] 2025-05-07T20:32:52.1296566Z 2025-05-07T20:32:52.1296753Z if contiguous: 2025-05-07T20:32:52.1296981Z x0 = x0.contiguous() 2025-05-07T20:32:52.1297232Z x1 = x1.contiguous() 2025-05-07T20:32:52.1297470Z 2025-05-07T20:32:52.1297667Z if scale_ub is not None: 2025-05-07T20:32:52.1297935Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.1298429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.1298738Z ) 2025-05-07T20:32:52.1298929Z else: 2025-05-07T20:32:52.1299138Z scale_ub_tensor = None 2025-05-07T20:32:52.1299390Z 2025-05-07T20:32:52.1299616Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.1299936Z op = silu_mul_quant 2025-05-07T20:32:52.1300182Z if compiled: 2025-05-07T20:32:52.1300421Z op = torch.compile(op) 2025-05-07T20:32:52.1300719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1301072Z 2025-05-07T20:32:52.1301264Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.1301425Z 2025-05-07T20:32:52.1301522Z moe/activation_test.py:117: 2025-05-07T20:32:52.1301812Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1302184Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.1302460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.1303013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.1303632Z return fn(*args, **kwargs) 2025-05-07T20:32:52.1304276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.1304951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.1305480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.1306158Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.1306813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.1307345Z kernel = self.compile( 2025-05-07T20:32:52.1307945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.1308595Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.1308989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.1309219Z 2025-05-07T20:32:52.1309422Z self = 2025-05-07T20:32:52.1310485Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.1311836Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeacc20>} 2025-05-07T20:32:52.1313215Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.1314224Z context = 2025-05-07T20:32:52.1314513Z 2025-05-07T20:32:52.1314678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.1315192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.1315649Z module_map=module_map) 2025-05-07T20:32:52.1316017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.1316380Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.1316642Z E ^ 2025-05-07T20:32:52.1317100Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.1317545Z 2025-05-07T20:32:52.1317963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2683697Z 2025-05-07T20:32:52.2683955Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2684967Z self=, 2025-05-07T20:32:52.2685716Z T=4096, 2025-05-07T20:32:52.2686115Z D=5120, 2025-05-07T20:32:52.2686470Z scale_ub=1200.0, 2025-05-07T20:32:52.2686961Z contiguous=False, 2025-05-07T20:32:52.2687548Z compiled=False, 2025-05-07T20:32:52.2687962Z ) 2025-05-07T20:32:52.2688534Z self = 2025-05-07T20:32:52.2689419Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.2690134Z 2025-05-07T20:32:52.2690275Z @given( 2025-05-07T20:32:52.2690565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2690873Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2691183Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2691509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2691831Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2692108Z ) 2025-05-07T20:32:52.2692521Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2692952Z def test_silu_mul_quant( 2025-05-07T20:32:52.2693186Z self, 2025-05-07T20:32:52.2693376Z T: int, 2025-05-07T20:32:52.2693571Z D: int, 2025-05-07T20:32:52.2693897Z scale_ub: Optional[float], 2025-05-07T20:32:52.2694164Z contiguous: bool, 2025-05-07T20:32:52.2694402Z compiled: bool, 2025-05-07T20:32:52.2694623Z ) -> None: 2025-05-07T20:32:52.2694833Z torch.manual_seed(2025) 2025-05-07T20:32:52.2695069Z 2025-05-07T20:32:52.2695334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2695669Z 2025-05-07T20:32:52.2695929Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2696217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2696521Z x = x_sign * x_clamp 2025-05-07T20:32:52.2696765Z x0 = x[:, :D] 2025-05-07T20:32:52.2696981Z x1 = x[:, D:] 2025-05-07T20:32:52.2697182Z 2025-05-07T20:32:52.2697369Z if contiguous: 2025-05-07T20:32:52.2697598Z x0 = x0.contiguous() 2025-05-07T20:32:52.2697850Z x1 = x1.contiguous() 2025-05-07T20:32:52.2698091Z 2025-05-07T20:32:52.2698456Z if scale_ub is not None: 2025-05-07T20:32:52.2698722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2699053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2699365Z ) 2025-05-07T20:32:52.2699562Z else: 2025-05-07T20:32:52.2699806Z scale_ub_tensor = None 2025-05-07T20:32:52.2700065Z 2025-05-07T20:32:52.2700289Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2700605Z op = silu_mul_quant 2025-05-07T20:32:52.2700923Z if compiled: 2025-05-07T20:32:52.2701168Z op = torch.compile(op) 2025-05-07T20:32:52.2701462Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2701736Z 2025-05-07T20:32:52.2701931Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2702092Z 2025-05-07T20:32:52.2702189Z moe/activation_test.py:117: 2025-05-07T20:32:52.2702480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2702814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2703086Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2703768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2704447Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2704977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2705653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2706307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2706835Z kernel = self.compile( 2025-05-07T20:32:52.2707363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2708011Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2708406Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2708629Z 2025-05-07T20:32:52.2708909Z self = 2025-05-07T20:32:52.2709968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2711318Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aead6c0>} 2025-05-07T20:32:52.2712728Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2713729Z context = 2025-05-07T20:32:52.2714007Z 2025-05-07T20:32:52.2714178Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2714690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2715150Z module_map=module_map) 2025-05-07T20:32:52.2715514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2715918Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2716179Z E ^ 2025-05-07T20:32:52.2716634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2717073Z 2025-05-07T20:32:52.2717483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2718010Z 2025-05-07T20:32:52.2718114Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2718516Z self=, 2025-05-07T20:32:52.2718911Z T=4096, 2025-05-07T20:32:52.2719100Z D=5120, 2025-05-07T20:32:52.2719289Z scale_ub=1200.0, 2025-05-07T20:32:52.2719522Z contiguous=False, 2025-05-07T20:32:52.2719786Z compiled=True, 2025-05-07T20:32:52.2719989Z ) 2025-05-07T20:32:52.2720311Z self = 2025-05-07T20:32:52.2720867Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.2721148Z 2025-05-07T20:32:52.2721237Z @given( 2025-05-07T20:32:52.2721471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2721800Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2722109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2722434Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2722763Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2723051Z ) 2025-05-07T20:32:52.2723395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2723843Z def test_silu_mul_quant( 2025-05-07T20:32:52.2724085Z self, 2025-05-07T20:32:52.2724277Z T: int, 2025-05-07T20:32:52.2724478Z D: int, 2025-05-07T20:32:52.2724701Z scale_ub: Optional[float], 2025-05-07T20:32:52.2724969Z contiguous: bool, 2025-05-07T20:32:52.2725214Z compiled: bool, 2025-05-07T20:32:52.2725441Z ) -> None: 2025-05-07T20:32:52.2725659Z torch.manual_seed(2025) 2025-05-07T20:32:52.2725896Z 2025-05-07T20:32:52.2726169Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2726514Z 2025-05-07T20:32:52.2726705Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2726999Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2727309Z x = x_sign * x_clamp 2025-05-07T20:32:52.2727545Z x0 = x[:, :D] 2025-05-07T20:32:52.2727767Z x1 = x[:, D:] 2025-05-07T20:32:52.2727980Z 2025-05-07T20:32:52.2728163Z if contiguous: 2025-05-07T20:32:52.2728446Z x0 = x0.contiguous() 2025-05-07T20:32:52.2728716Z x1 = x1.contiguous() 2025-05-07T20:32:52.2728951Z 2025-05-07T20:32:52.2729150Z if scale_ub is not None: 2025-05-07T20:32:52.2734892Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2735331Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2735646Z ) 2025-05-07T20:32:52.2735842Z else: 2025-05-07T20:32:52.2736045Z scale_ub_tensor = None 2025-05-07T20:32:52.2736384Z 2025-05-07T20:32:52.2736618Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2736930Z op = silu_mul_quant 2025-05-07T20:32:52.2737181Z if compiled: 2025-05-07T20:32:52.2737426Z op = torch.compile(op) 2025-05-07T20:32:52.2737722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2737994Z 2025-05-07T20:32:52.2738183Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2738349Z 2025-05-07T20:32:52.2738450Z moe/activation_test.py:117: 2025-05-07T20:32:52.2738745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2739073Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2739354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2740012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.2740567Z return fn(*args, **kwargs) 2025-05-07T20:32:52.2741213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2741893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2742416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2743083Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2743738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2744261Z kernel = self.compile( 2025-05-07T20:32:52.2744790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2745486Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2745890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2746115Z 2025-05-07T20:32:52.2746323Z self = 2025-05-07T20:32:52.2747387Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2748733Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeaefc0>} 2025-05-07T20:32:52.2750067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2751075Z context = 2025-05-07T20:32:52.2751358Z 2025-05-07T20:32:52.2751522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2752037Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2752496Z module_map=module_map) 2025-05-07T20:32:52.2752860Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2753204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2753464Z E ^ 2025-05-07T20:32:52.2753918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2754402Z 2025-05-07T20:32:52.2754812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2755319Z 2025-05-07T20:32:52.2755430Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2755842Z self=, 2025-05-07T20:32:52.2756242Z T=2048, 2025-05-07T20:32:52.2756477Z D=7168, 2025-05-07T20:32:52.2756668Z scale_ub=1200.0, 2025-05-07T20:32:52.2756892Z contiguous=False, 2025-05-07T20:32:52.2757114Z compiled=False, 2025-05-07T20:32:52.4685575Z ) 2025-05-07T20:32:52.4686312Z self = 2025-05-07T20:32:52.4687469Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.4688248Z 2025-05-07T20:32:52.4688439Z @given( 2025-05-07T20:32:52.4689088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.4689756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.4690193Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.4690530Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.4690970Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.4691271Z ) 2025-05-07T20:32:52.4691627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.4692071Z def test_silu_mul_quant( 2025-05-07T20:32:52.4692315Z self, 2025-05-07T20:32:52.4692516Z T: int, 2025-05-07T20:32:52.4692709Z D: int, 2025-05-07T20:32:52.4692934Z scale_ub: Optional[float], 2025-05-07T20:32:52.4693213Z contiguous: bool, 2025-05-07T20:32:52.4693451Z compiled: bool, 2025-05-07T20:32:52.4693758Z ) -> None: 2025-05-07T20:32:52.4693981Z torch.manual_seed(2025) 2025-05-07T20:32:52.4694231Z 2025-05-07T20:32:52.4694504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.4694852Z 2025-05-07T20:32:52.4695050Z x_sign = torch.sign(x) 2025-05-07T20:32:52.4695345Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.4695671Z x = x_sign * x_clamp 2025-05-07T20:32:52.4695997Z x0 = x[:, :D] 2025-05-07T20:32:52.4696221Z x1 = x[:, D:] 2025-05-07T20:32:52.4696434Z 2025-05-07T20:32:52.4696627Z if contiguous: 2025-05-07T20:32:52.4696862Z x0 = x0.contiguous() 2025-05-07T20:32:52.4697123Z x1 = x1.contiguous() 2025-05-07T20:32:52.4697368Z 2025-05-07T20:32:52.4697565Z if scale_ub is not None: 2025-05-07T20:32:52.4697841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.4698360Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.4698671Z ) 2025-05-07T20:32:52.4698872Z else: 2025-05-07T20:32:52.4699091Z scale_ub_tensor = None 2025-05-07T20:32:52.4699345Z 2025-05-07T20:32:52.4699576Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.4699897Z op = silu_mul_quant 2025-05-07T20:32:52.4700152Z if compiled: 2025-05-07T20:32:52.4700405Z op = torch.compile(op) 2025-05-07T20:32:52.4700713Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4700992Z 2025-05-07T20:32:52.4701187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.4701390Z 2025-05-07T20:32:52.4701496Z moe/activation_test.py:117: 2025-05-07T20:32:52.4701796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4702132Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.4702417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4703104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.4703863Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.4704407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.4705090Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.4705755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.4706291Z kernel = self.compile( 2025-05-07T20:32:52.4706841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.4707567Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.4707976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4708207Z 2025-05-07T20:32:52.4708415Z self = 2025-05-07T20:32:52.4709486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.4710954Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeafec0>} 2025-05-07T20:32:52.4712286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.4713300Z context = 2025-05-07T20:32:52.4713587Z 2025-05-07T20:32:52.4713760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.4714280Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.4714749Z module_map=module_map) 2025-05-07T20:32:52.4715124Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.4715483Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.4715742Z E ^ 2025-05-07T20:32:52.4716265Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.4716716Z 2025-05-07T20:32:52.4717130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.4717640Z 2025-05-07T20:32:52.4717750Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.4718167Z self=, 2025-05-07T20:32:52.4718570Z T=1, 2025-05-07T20:32:52.4718761Z D=7168, 2025-05-07T20:32:52.4718954Z scale_ub=None, 2025-05-07T20:32:52.4719174Z contiguous=True, 2025-05-07T20:32:52.4719403Z compiled=False, 2025-05-07T20:32:52.4719608Z ) 2025-05-07T20:32:52.4719972Z self = 2025-05-07T20:32:52.4720464Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.4720727Z 2025-05-07T20:32:52.4720819Z @given( 2025-05-07T20:32:52.4721052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.4721369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.4721680Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.4722010Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.4722342Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.4722630Z ) 2025-05-07T20:32:52.4722978Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.4723421Z def test_silu_mul_quant( 2025-05-07T20:32:52.4723662Z self, 2025-05-07T20:32:52.4723855Z T: int, 2025-05-07T20:32:52.4724135Z D: int, 2025-05-07T20:32:52.4724356Z scale_ub: Optional[float], 2025-05-07T20:32:52.4724626Z contiguous: bool, 2025-05-07T20:32:52.4724863Z compiled: bool, 2025-05-07T20:32:52.4725089Z ) -> None: 2025-05-07T20:32:52.4725309Z torch.manual_seed(2025) 2025-05-07T20:32:52.4725548Z 2025-05-07T20:32:52.4725828Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.4726173Z 2025-05-07T20:32:52.4726368Z x_sign = torch.sign(x) 2025-05-07T20:32:52.4726708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.4727018Z x = x_sign * x_clamp 2025-05-07T20:32:52.4727260Z x0 = x[:, :D] 2025-05-07T20:32:52.4727484Z x1 = x[:, D:] 2025-05-07T20:32:52.4727693Z 2025-05-07T20:32:52.4727879Z if contiguous: 2025-05-07T20:32:52.4728116Z x0 = x0.contiguous() 2025-05-07T20:32:52.4728378Z x1 = x1.contiguous() 2025-05-07T20:32:52.4728621Z 2025-05-07T20:32:52.4728817Z if scale_ub is not None: 2025-05-07T20:32:52.4729095Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.4729428Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.4729742Z ) 2025-05-07T20:32:52.4729981Z else: 2025-05-07T20:32:52.4730238Z scale_ub_tensor = None 2025-05-07T20:32:52.4730507Z 2025-05-07T20:32:52.4730742Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.4731062Z op = silu_mul_quant 2025-05-07T20:32:52.4731314Z if compiled: 2025-05-07T20:32:52.4731562Z op = torch.compile(op) 2025-05-07T20:32:52.4731870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4732142Z 2025-05-07T20:32:52.4732338Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.4732503Z 2025-05-07T20:32:52.4732605Z moe/activation_test.py:117: 2025-05-07T20:32:52.4732901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4733238Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.4733523Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4734298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.4735024Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.4735562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.4736242Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.4736905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.4737436Z kernel = self.compile( 2025-05-07T20:32:52.4737980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.4738638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.4739035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4739270Z 2025-05-07T20:32:52.4739482Z self = 2025-05-07T20:32:52.4740607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.4741964Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26ccc0>} 2025-05-07T20:32:52.4743290Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.4744349Z context = 2025-05-07T20:32:52.4744639Z 2025-05-07T20:32:52.4744806Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.4745326Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.4745798Z module_map=module_map) 2025-05-07T20:32:52.4746161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.4746559Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.4746825Z E ^ 2025-05-07T20:32:52.4747283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.4747732Z 2025-05-07T20:32:52.4748144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.4748657Z 2025-05-07T20:32:52.4748767Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.4749180Z self=, 2025-05-07T20:32:52.4749583Z T=16384, 2025-05-07T20:32:52.4749778Z D=7168, 2025-05-07T20:32:52.4749973Z scale_ub=1200.0, 2025-05-07T20:32:52.4750243Z contiguous=False, 2025-05-07T20:32:52.4750473Z compiled=True, 2025-05-07T20:32:52.4750680Z ) 2025-05-07T20:32:52.4750997Z self = 2025-05-07T20:32:52.4751493Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.4751771Z 2025-05-07T20:32:52.4751851Z @given( 2025-05-07T20:32:52.4752078Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.4752389Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.4752691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.4753018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.4753343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.4753629Z ) 2025-05-07T20:32:52.4753975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.4754414Z def test_silu_mul_quant( 2025-05-07T20:32:52.4754657Z self, 2025-05-07T20:32:52.4754857Z T: int, 2025-05-07T20:32:52.4755051Z D: int, 2025-05-07T20:32:52.4755316Z scale_ub: Optional[float], 2025-05-07T20:32:52.4755586Z contiguous: bool, 2025-05-07T20:32:52.4755817Z compiled: bool, 2025-05-07T20:32:52.4756043Z ) -> None: 2025-05-07T20:32:52.4756260Z torch.manual_seed(2025) 2025-05-07T20:32:52.4756495Z 2025-05-07T20:32:52.4756769Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.4757109Z 2025-05-07T20:32:52.4757303Z x_sign = torch.sign(x) 2025-05-07T20:32:52.4757586Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.4757892Z x = x_sign * x_clamp 2025-05-07T20:32:52.4758137Z x0 = x[:, :D] 2025-05-07T20:32:52.4758349Z x1 = x[:, D:] 2025-05-07T20:32:52.4758555Z 2025-05-07T20:32:52.4758741Z if contiguous: 2025-05-07T20:32:52.4758964Z x0 = x0.contiguous() 2025-05-07T20:32:52.4759225Z x1 = x1.contiguous() 2025-05-07T20:32:52.4759469Z 2025-05-07T20:32:52.4759660Z if scale_ub is not None: 2025-05-07T20:32:52.4759937Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.4760268Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.4760581Z ) 2025-05-07T20:32:52.4760780Z else: 2025-05-07T20:32:52.4760989Z scale_ub_tensor = None 2025-05-07T20:32:52.4761234Z 2025-05-07T20:32:52.4761466Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.4761778Z op = silu_mul_quant 2025-05-07T20:32:52.4762027Z if compiled: 2025-05-07T20:32:52.4762268Z op = torch.compile(op) 2025-05-07T20:32:52.4762621Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4762897Z 2025-05-07T20:32:52.4763086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.4763249Z 2025-05-07T20:32:52.4763348Z moe/activation_test.py:117: 2025-05-07T20:32:52.4763644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4763973Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.4764247Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.4764803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.4765396Z return fn(*args, **kwargs) 2025-05-07T20:32:52.4766049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.4766723Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.4767256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.4767930Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.4768627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.4769155Z kernel = self.compile( 2025-05-07T20:32:52.4769696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.4770394Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.4770791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.4771016Z 2025-05-07T20:32:52.4771225Z self = 2025-05-07T20:32:52.4772277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.4773628Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26e0c0>} 2025-05-07T20:32:52.4775094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.4776103Z context = 2025-05-07T20:32:52.4776385Z 2025-05-07T20:32:52.4776556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.4777067Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.4777530Z module_map=module_map) 2025-05-07T20:32:52.4777898Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.4778251Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.4778508Z E ^ 2025-05-07T20:32:52.4778962Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.4779408Z 2025-05-07T20:32:52.4779824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6097861Z 2025-05-07T20:32:52.6098762Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6099587Z self=, 2025-05-07T20:32:52.6100029Z T=1, 2025-05-07T20:32:52.6100210Z D=7168, 2025-05-07T20:32:52.6100399Z scale_ub=None, 2025-05-07T20:32:52.6100606Z contiguous=False, 2025-05-07T20:32:52.6100828Z compiled=False, 2025-05-07T20:32:52.6101027Z ) 2025-05-07T20:32:52.6101332Z self = 2025-05-07T20:32:52.6101974Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.6102234Z 2025-05-07T20:32:52.6102312Z @given( 2025-05-07T20:32:52.6102538Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.6102849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.6103152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.6103473Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.6103788Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.6104142Z ) 2025-05-07T20:32:52.6104480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.6104904Z def test_silu_mul_quant( 2025-05-07T20:32:52.6105138Z self, 2025-05-07T20:32:52.6105327Z T: int, 2025-05-07T20:32:52.6105514Z D: int, 2025-05-07T20:32:52.6105724Z scale_ub: Optional[float], 2025-05-07T20:32:52.6105992Z contiguous: bool, 2025-05-07T20:32:52.6106225Z compiled: bool, 2025-05-07T20:32:52.6106438Z ) -> None: 2025-05-07T20:32:52.6106646Z torch.manual_seed(2025) 2025-05-07T20:32:52.6106883Z 2025-05-07T20:32:52.6107213Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.6107553Z 2025-05-07T20:32:52.6107744Z x_sign = torch.sign(x) 2025-05-07T20:32:52.6108023Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.6108324Z x = x_sign * x_clamp 2025-05-07T20:32:52.6108565Z x0 = x[:, :D] 2025-05-07T20:32:52.6108770Z x1 = x[:, D:] 2025-05-07T20:32:52.6108970Z 2025-05-07T20:32:52.6109148Z if contiguous: 2025-05-07T20:32:52.6109365Z x0 = x0.contiguous() 2025-05-07T20:32:52.6109615Z x1 = x1.contiguous() 2025-05-07T20:32:52.6109847Z 2025-05-07T20:32:52.6110028Z if scale_ub is not None: 2025-05-07T20:32:52.6110296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.6110624Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.6110925Z ) 2025-05-07T20:32:52.6111110Z else: 2025-05-07T20:32:52.6111313Z scale_ub_tensor = None 2025-05-07T20:32:52.6111562Z 2025-05-07T20:32:52.6111784Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.6112159Z op = silu_mul_quant 2025-05-07T20:32:52.6112406Z if compiled: 2025-05-07T20:32:52.6112649Z op = torch.compile(op) 2025-05-07T20:32:52.6112950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6113224Z 2025-05-07T20:32:52.6113412Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.6113578Z 2025-05-07T20:32:52.6113677Z moe/activation_test.py:117: 2025-05-07T20:32:52.6113965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6114318Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.6114598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6115281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.6115961Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.6116496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.6117161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.6117813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.6118333Z kernel = self.compile( 2025-05-07T20:32:52.6118863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.6119501Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.6119890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6120163Z 2025-05-07T20:32:52.6120371Z self = 2025-05-07T20:32:52.6121435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.6122788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26ec00>} 2025-05-07T20:32:52.6124148Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.6125146Z context = 2025-05-07T20:32:52.6125428Z 2025-05-07T20:32:52.6125596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6126099Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6126559Z module_map=module_map) 2025-05-07T20:32:52.6126969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6127332Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6127590Z E ^ 2025-05-07T20:32:52.6134045Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6134538Z 2025-05-07T20:32:52.6134965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6135473Z 2025-05-07T20:32:52.6135576Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6135986Z self=, 2025-05-07T20:32:52.6136386Z T=2048, 2025-05-07T20:32:52.6136576Z D=7168, 2025-05-07T20:32:52.6136759Z scale_ub=None, 2025-05-07T20:32:52.6136971Z contiguous=False, 2025-05-07T20:32:52.6137192Z compiled=True, 2025-05-07T20:32:52.6137386Z ) 2025-05-07T20:32:52.6137701Z self = 2025-05-07T20:32:52.6138265Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.6138532Z 2025-05-07T20:32:52.6138608Z @given( 2025-05-07T20:32:52.6138844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.6139154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.6139448Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.6139772Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.6140093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.6140375Z ) 2025-05-07T20:32:52.6140714Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.6141158Z def test_silu_mul_quant( 2025-05-07T20:32:52.6141397Z self, 2025-05-07T20:32:52.6141585Z T: int, 2025-05-07T20:32:52.6141779Z D: int, 2025-05-07T20:32:52.6142001Z scale_ub: Optional[float], 2025-05-07T20:32:52.6142266Z contiguous: bool, 2025-05-07T20:32:52.6142503Z compiled: bool, 2025-05-07T20:32:52.6142723Z ) -> None: 2025-05-07T20:32:52.6142932Z torch.manual_seed(2025) 2025-05-07T20:32:52.6143175Z 2025-05-07T20:32:52.6143446Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.6143773Z 2025-05-07T20:32:52.6143961Z x_sign = torch.sign(x) 2025-05-07T20:32:52.6144249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.6144551Z x = x_sign * x_clamp 2025-05-07T20:32:52.6144788Z x0 = x[:, :D] 2025-05-07T20:32:52.6145002Z x1 = x[:, D:] 2025-05-07T20:32:52.6145264Z 2025-05-07T20:32:52.6145443Z if contiguous: 2025-05-07T20:32:52.6145674Z x0 = x0.contiguous() 2025-05-07T20:32:52.6145926Z x1 = x1.contiguous() 2025-05-07T20:32:52.6146160Z 2025-05-07T20:32:52.6146349Z if scale_ub is not None: 2025-05-07T20:32:52.6146619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.6146947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.6147252Z ) 2025-05-07T20:32:52.6147435Z else: 2025-05-07T20:32:52.6147696Z scale_ub_tensor = None 2025-05-07T20:32:52.6147944Z 2025-05-07T20:32:52.6148166Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.6148476Z op = silu_mul_quant 2025-05-07T20:32:52.6148722Z if compiled: 2025-05-07T20:32:52.6148967Z op = torch.compile(op) 2025-05-07T20:32:52.6149254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6149526Z 2025-05-07T20:32:52.6149716Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.6149877Z 2025-05-07T20:32:52.6149974Z moe/activation_test.py:117: 2025-05-07T20:32:52.6150269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6150600Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.6150926Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.6151489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.6152046Z return fn(*args, **kwargs) 2025-05-07T20:32:52.6152694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.6153368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.6153892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.6154559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.6155209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.6155733Z kernel = self.compile( 2025-05-07T20:32:52.6156268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.6156960Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.6157355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.6157585Z 2025-05-07T20:32:52.6157787Z self = 2025-05-07T20:32:52.6158847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.6160200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a6002c0>} 2025-05-07T20:32:52.6161524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.6162523Z context = 2025-05-07T20:32:52.6162809Z 2025-05-07T20:32:52.6162973Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.6163483Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.6163936Z module_map=module_map) 2025-05-07T20:32:52.6164298Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.6164647Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.6164951Z E ^ 2025-05-07T20:32:52.6165401Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.6165845Z 2025-05-07T20:32:52.6166257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.6166762Z 2025-05-07T20:32:52.6166870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.6167274Z self=, 2025-05-07T20:32:52.6167740Z T=4096, 2025-05-07T20:32:52.6167926Z D=7168, 2025-05-07T20:32:52.6168115Z scale_ub=None, 2025-05-07T20:32:52.6168325Z contiguous=False, 2025-05-07T20:32:52.6168549Z compiled=True, 2025-05-07T20:32:53.0265289Z ) 2025-05-07T20:32:53.0265626Z self = 2025-05-07T20:32:53.0266156Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.0266451Z 2025-05-07T20:32:53.0266546Z @given( 2025-05-07T20:32:53.0266797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0267112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0267421Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0267866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0268199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0268491Z ) 2025-05-07T20:32:53.0268841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0269293Z def test_silu_mul_quant( 2025-05-07T20:32:53.0269534Z self, 2025-05-07T20:32:53.0269733Z T: int, 2025-05-07T20:32:53.0269963Z D: int, 2025-05-07T20:32:53.0270204Z scale_ub: Optional[float], 2025-05-07T20:32:53.0270473Z contiguous: bool, 2025-05-07T20:32:53.0270712Z compiled: bool, 2025-05-07T20:32:53.0270936Z ) -> None: 2025-05-07T20:32:53.0271157Z torch.manual_seed(2025) 2025-05-07T20:32:53.0271400Z 2025-05-07T20:32:53.0271671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0272017Z 2025-05-07T20:32:53.0272214Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0272505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0272887Z x = x_sign * x_clamp 2025-05-07T20:32:53.0273132Z x0 = x[:, :D] 2025-05-07T20:32:53.0273347Z x1 = x[:, D:] 2025-05-07T20:32:53.0273557Z 2025-05-07T20:32:53.0273743Z if contiguous: 2025-05-07T20:32:53.0273968Z x0 = x0.contiguous() 2025-05-07T20:32:53.0274229Z x1 = x1.contiguous() 2025-05-07T20:32:53.0274473Z 2025-05-07T20:32:53.0274664Z if scale_ub is not None: 2025-05-07T20:32:53.0274981Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0275312Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0275625Z ) 2025-05-07T20:32:53.0275821Z else: 2025-05-07T20:32:53.0276027Z scale_ub_tensor = None 2025-05-07T20:32:53.0276280Z 2025-05-07T20:32:53.0276514Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0276830Z op = silu_mul_quant 2025-05-07T20:32:53.0277078Z if compiled: 2025-05-07T20:32:53.0277336Z op = torch.compile(op) 2025-05-07T20:32:53.0277634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0277908Z 2025-05-07T20:32:53.0278105Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0278269Z 2025-05-07T20:32:53.0278373Z moe/activation_test.py:117: 2025-05-07T20:32:53.0278665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0278996Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0279280Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0279834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0280476Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0281127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0281810Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0282345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0283021Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0283748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0284284Z kernel = self.compile( 2025-05-07T20:32:53.0284821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0285474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0285878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0286106Z 2025-05-07T20:32:53.0286317Z self = 2025-05-07T20:32:53.0287429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0288787Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a600d60>} 2025-05-07T20:32:53.0290121Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0291134Z context = 2025-05-07T20:32:53.0291425Z 2025-05-07T20:32:53.0291591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0292108Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0292580Z module_map=module_map) 2025-05-07T20:32:53.0292992Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0293344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0293610Z E ^ 2025-05-07T20:32:53.0294171Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0294616Z 2025-05-07T20:32:53.0295026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0295538Z 2025-05-07T20:32:53.0295643Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0296061Z self=, 2025-05-07T20:32:53.0296463Z T=16384, 2025-05-07T20:32:53.0296659Z D=5120, 2025-05-07T20:32:53.0296857Z scale_ub=1200.0, 2025-05-07T20:32:53.0297091Z contiguous=False, 2025-05-07T20:32:53.0297322Z compiled=False, 2025-05-07T20:32:53.0297527Z ) 2025-05-07T20:32:53.0297848Z self = 2025-05-07T20:32:53.0298512Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.0298790Z 2025-05-07T20:32:53.0298869Z @given( 2025-05-07T20:32:53.0299101Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0299413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0299718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0300047Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0300377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0300739Z ) 2025-05-07T20:32:53.0301083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0301527Z def test_silu_mul_quant( 2025-05-07T20:32:53.0301779Z self, 2025-05-07T20:32:53.0301974Z T: int, 2025-05-07T20:32:53.0302181Z D: int, 2025-05-07T20:32:53.0302399Z scale_ub: Optional[float], 2025-05-07T20:32:53.0302670Z contiguous: bool, 2025-05-07T20:32:53.0302909Z compiled: bool, 2025-05-07T20:32:53.0303132Z ) -> None: 2025-05-07T20:32:53.0303413Z torch.manual_seed(2025) 2025-05-07T20:32:53.0303652Z 2025-05-07T20:32:53.0303925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0304263Z 2025-05-07T20:32:53.0304458Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0304748Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0305057Z x = x_sign * x_clamp 2025-05-07T20:32:53.0305293Z x0 = x[:, :D] 2025-05-07T20:32:53.0305513Z x1 = x[:, D:] 2025-05-07T20:32:53.0305720Z 2025-05-07T20:32:53.0305903Z if contiguous: 2025-05-07T20:32:53.0306136Z x0 = x0.contiguous() 2025-05-07T20:32:53.0306391Z x1 = x1.contiguous() 2025-05-07T20:32:53.0306627Z 2025-05-07T20:32:53.0306888Z if scale_ub is not None: 2025-05-07T20:32:53.0307172Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0307502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0307818Z ) 2025-05-07T20:32:53.0308009Z else: 2025-05-07T20:32:53.0308215Z scale_ub_tensor = None 2025-05-07T20:32:53.0308467Z 2025-05-07T20:32:53.0308695Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0309007Z op = silu_mul_quant 2025-05-07T20:32:53.0309256Z if compiled: 2025-05-07T20:32:53.0309507Z op = torch.compile(op) 2025-05-07T20:32:53.0309806Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0310081Z 2025-05-07T20:32:53.0310278Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0310439Z 2025-05-07T20:32:53.0310546Z moe/activation_test.py:117: 2025-05-07T20:32:53.0310843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0311172Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0311522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0312199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0312885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0313421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0314092Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0314749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0315286Z kernel = self.compile( 2025-05-07T20:32:53.0315822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0316474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0316872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0317103Z 2025-05-07T20:32:53.0317309Z self = 2025-05-07T20:32:53.0318379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0319730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a601c60>} 2025-05-07T20:32:53.0321150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0322160Z context = 2025-05-07T20:32:53.0322448Z 2025-05-07T20:32:53.0322614Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0323125Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0323625Z module_map=module_map) 2025-05-07T20:32:53.0323987Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0324339Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0324598Z E ^ 2025-05-07T20:32:53.0325056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0325503Z 2025-05-07T20:32:53.0325912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.0326419Z 2025-05-07T20:32:53.0326575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.0326986Z self=, 2025-05-07T20:32:53.0327385Z T=16384, 2025-05-07T20:32:53.0327579Z D=5120, 2025-05-07T20:32:53.0327774Z scale_ub=1200.0, 2025-05-07T20:32:53.0328002Z contiguous=True, 2025-05-07T20:32:53.0328228Z compiled=True, 2025-05-07T20:32:53.0328425Z ) 2025-05-07T20:32:53.0328739Z self = 2025-05-07T20:32:53.0329231Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.0329503Z 2025-05-07T20:32:53.0329583Z @given( 2025-05-07T20:32:53.0329811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.0330176Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.0330482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.0330809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.0331138Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.0331427Z ) 2025-05-07T20:32:53.0331818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.0332256Z def test_silu_mul_quant( 2025-05-07T20:32:53.0332496Z self, 2025-05-07T20:32:53.0332687Z T: int, 2025-05-07T20:32:53.0332888Z D: int, 2025-05-07T20:32:53.0333108Z scale_ub: Optional[float], 2025-05-07T20:32:53.0333377Z contiguous: bool, 2025-05-07T20:32:53.0333611Z compiled: bool, 2025-05-07T20:32:53.0333918Z ) -> None: 2025-05-07T20:32:53.0334137Z torch.manual_seed(2025) 2025-05-07T20:32:53.0334372Z 2025-05-07T20:32:53.0334647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.0334993Z 2025-05-07T20:32:53.0335187Z x_sign = torch.sign(x) 2025-05-07T20:32:53.0335478Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.0335794Z x = x_sign * x_clamp 2025-05-07T20:32:53.0336036Z x0 = x[:, :D] 2025-05-07T20:32:53.0336262Z x1 = x[:, D:] 2025-05-07T20:32:53.0336469Z 2025-05-07T20:32:53.0336654Z if contiguous: 2025-05-07T20:32:53.0336887Z x0 = x0.contiguous() 2025-05-07T20:32:53.0337157Z x1 = x1.contiguous() 2025-05-07T20:32:53.0337400Z 2025-05-07T20:32:53.0337595Z if scale_ub is not None: 2025-05-07T20:32:53.0337867Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.0338203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.0338510Z ) 2025-05-07T20:32:53.0338713Z else: 2025-05-07T20:32:53.0338926Z scale_ub_tensor = None 2025-05-07T20:32:53.0339254Z 2025-05-07T20:32:53.0339486Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.0339806Z op = silu_mul_quant 2025-05-07T20:32:53.0340051Z if compiled: 2025-05-07T20:32:53.0340299Z op = torch.compile(op) 2025-05-07T20:32:53.0340597Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0340870Z 2025-05-07T20:32:53.0341067Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.0341230Z 2025-05-07T20:32:53.0341334Z moe/activation_test.py:117: 2025-05-07T20:32:53.0341670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0341999Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.0342281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.0342837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.0343385Z return fn(*args, **kwargs) 2025-05-07T20:32:53.0344037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.0344716Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.0345290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.0345967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.0346628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.0347160Z kernel = self.compile( 2025-05-07T20:32:53.0347690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.0348340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.0348739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.0348969Z 2025-05-07T20:32:53.0349180Z self = 2025-05-07T20:32:53.0350239Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.0351631Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a603380>} 2025-05-07T20:32:53.0352964Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.0353972Z context = 2025-05-07T20:32:53.0354253Z 2025-05-07T20:32:53.0354418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.0354935Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.0355398Z module_map=module_map) 2025-05-07T20:32:53.0355767Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.0356118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.0356380Z E ^ 2025-05-07T20:32:53.0356835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.0357279Z 2025-05-07T20:32:53.0357689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1898020Z 2025-05-07T20:32:53.1898337Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1898759Z self=, 2025-05-07T20:32:53.1899190Z T=16384, 2025-05-07T20:32:53.1899487Z D=5120, 2025-05-07T20:32:53.1899692Z scale_ub=None, 2025-05-07T20:32:53.1899912Z contiguous=False, 2025-05-07T20:32:53.1900140Z compiled=True, 2025-05-07T20:32:53.1900340Z ) 2025-05-07T20:32:53.1900661Z self = 2025-05-07T20:32:53.1901159Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.1901430Z 2025-05-07T20:32:53.1901510Z @given( 2025-05-07T20:32:53.1901743Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1902131Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1902437Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1902764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1903094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1903381Z ) 2025-05-07T20:32:53.1903725Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1904170Z def test_silu_mul_quant( 2025-05-07T20:32:53.1904410Z self, 2025-05-07T20:32:53.1904604Z T: int, 2025-05-07T20:32:53.1904802Z D: int, 2025-05-07T20:32:53.1905019Z scale_ub: Optional[float], 2025-05-07T20:32:53.1905353Z contiguous: bool, 2025-05-07T20:32:53.1905598Z compiled: bool, 2025-05-07T20:32:53.1905827Z ) -> None: 2025-05-07T20:32:53.1906037Z torch.manual_seed(2025) 2025-05-07T20:32:53.1906281Z 2025-05-07T20:32:53.1906552Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1906899Z 2025-05-07T20:32:53.1907095Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1907386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1907696Z x = x_sign * x_clamp 2025-05-07T20:32:53.1907934Z x0 = x[:, :D] 2025-05-07T20:32:53.1908163Z x1 = x[:, D:] 2025-05-07T20:32:53.1908379Z 2025-05-07T20:32:53.1908566Z if contiguous: 2025-05-07T20:32:53.1914568Z x0 = x0.contiguous() 2025-05-07T20:32:53.1914843Z x1 = x1.contiguous() 2025-05-07T20:32:53.1915080Z 2025-05-07T20:32:53.1915273Z if scale_ub is not None: 2025-05-07T20:32:53.1915548Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1915986Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1916297Z ) 2025-05-07T20:32:53.1916490Z else: 2025-05-07T20:32:53.1916699Z scale_ub_tensor = None 2025-05-07T20:32:53.1916950Z 2025-05-07T20:32:53.1917186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1917500Z op = silu_mul_quant 2025-05-07T20:32:53.1917747Z if compiled: 2025-05-07T20:32:53.1917998Z op = torch.compile(op) 2025-05-07T20:32:53.1918290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1918557Z 2025-05-07T20:32:53.1918749Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1918913Z 2025-05-07T20:32:53.1919017Z moe/activation_test.py:117: 2025-05-07T20:32:53.1919313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1919638Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1919917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1920479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.1921030Z return fn(*args, **kwargs) 2025-05-07T20:32:53.1921686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1922364Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1922895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1923566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1924277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1924808Z kernel = self.compile( 2025-05-07T20:32:53.1925345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1925995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1926391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1926663Z 2025-05-07T20:32:53.1926877Z self = 2025-05-07T20:32:53.1927935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1929288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3a85e0>} 2025-05-07T20:32:53.1930708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1931720Z context = 2025-05-07T20:32:53.1932003Z 2025-05-07T20:32:53.1932167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1932679Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1933142Z module_map=module_map) 2025-05-07T20:32:53.1933502Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1933923Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1934177Z E ^ 2025-05-07T20:32:53.1934629Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1935076Z 2025-05-07T20:32:53.1935490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.1936002Z 2025-05-07T20:32:53.1936105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.1936556Z self=, 2025-05-07T20:32:53.1936954Z T=2048, 2025-05-07T20:32:53.1937144Z D=5120, 2025-05-07T20:32:53.1937331Z scale_ub=None, 2025-05-07T20:32:53.1937544Z contiguous=False, 2025-05-07T20:32:53.1937766Z compiled=True, 2025-05-07T20:32:53.1937963Z ) 2025-05-07T20:32:53.1938281Z self = 2025-05-07T20:32:53.1938764Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:53.1939031Z 2025-05-07T20:32:53.1939112Z @given( 2025-05-07T20:32:53.1939346Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.1939659Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.1939954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.1940315Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.1940659Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.1940941Z ) 2025-05-07T20:32:53.1941282Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.1941725Z def test_silu_mul_quant( 2025-05-07T20:32:53.1941974Z self, 2025-05-07T20:32:53.1942164Z T: int, 2025-05-07T20:32:53.1942362Z D: int, 2025-05-07T20:32:53.1942577Z scale_ub: Optional[float], 2025-05-07T20:32:53.1942840Z contiguous: bool, 2025-05-07T20:32:53.1943075Z compiled: bool, 2025-05-07T20:32:53.1943296Z ) -> None: 2025-05-07T20:32:53.1943503Z torch.manual_seed(2025) 2025-05-07T20:32:53.1943796Z 2025-05-07T20:32:53.1944062Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.1944390Z 2025-05-07T20:32:53.1944580Z x_sign = torch.sign(x) 2025-05-07T20:32:53.1944869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.1945171Z x = x_sign * x_clamp 2025-05-07T20:32:53.1945417Z x0 = x[:, :D] 2025-05-07T20:32:53.1945633Z x1 = x[:, D:] 2025-05-07T20:32:53.1945836Z 2025-05-07T20:32:53.1946014Z if contiguous: 2025-05-07T20:32:53.1946291Z x0 = x0.contiguous() 2025-05-07T20:32:53.1946546Z x1 = x1.contiguous() 2025-05-07T20:32:53.1946779Z 2025-05-07T20:32:53.1946973Z if scale_ub is not None: 2025-05-07T20:32:53.1947242Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.1947566Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.1947872Z ) 2025-05-07T20:32:53.1948063Z else: 2025-05-07T20:32:53.1948267Z scale_ub_tensor = None 2025-05-07T20:32:53.1948514Z 2025-05-07T20:32:53.1948741Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.1949047Z op = silu_mul_quant 2025-05-07T20:32:53.1949290Z if compiled: 2025-05-07T20:32:53.1949578Z op = torch.compile(op) 2025-05-07T20:32:53.1949869Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1950140Z 2025-05-07T20:32:53.1950332Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.1950497Z 2025-05-07T20:32:53.1950605Z moe/activation_test.py:117: 2025-05-07T20:32:53.1950892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1951224Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.1951509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.1952059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.1952620Z return fn(*args, **kwargs) 2025-05-07T20:32:53.1953276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.1953950Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.1954549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.1955219Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.1955876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.1956396Z kernel = self.compile( 2025-05-07T20:32:53.1956929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.1957578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.1957971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.1958199Z 2025-05-07T20:32:53.1958401Z self = 2025-05-07T20:32:53.1959470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.1960869Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3a9440>} 2025-05-07T20:32:53.1962192Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.1963194Z context = 2025-05-07T20:32:53.1963526Z 2025-05-07T20:32:53.1963690Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.1964196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.1964657Z module_map=module_map) 2025-05-07T20:32:53.1965012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.1965358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.1965613Z E ^ 2025-05-07T20:32:53.1966059Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.1966546Z 2025-05-07T20:32:53.1966952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.3554494Z 2025-05-07T20:32:53.3554669Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3556467Z self=, 2025-05-07T20:32:53.3557402Z T=2048, 2025-05-07T20:32:53.3557799Z D=5120, 2025-05-07T20:32:53.3558177Z scale_ub=1200.0, 2025-05-07T20:32:53.3558632Z contiguous=False, 2025-05-07T20:32:53.3559084Z compiled=True, 2025-05-07T20:32:53.3559485Z ) 2025-05-07T20:32:53.3560267Z self = 2025-05-07T20:32:53.3560785Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.3561060Z 2025-05-07T20:32:53.3561150Z @given( 2025-05-07T20:32:53.3561392Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.3561712Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.3562018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.3562352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.3562686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.3562981Z ) 2025-05-07T20:32:53.3563331Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.3563777Z def test_silu_mul_quant( 2025-05-07T20:32:53.3564027Z self, 2025-05-07T20:32:53.3564221Z T: int, 2025-05-07T20:32:53.3564427Z D: int, 2025-05-07T20:32:53.3564661Z scale_ub: Optional[float], 2025-05-07T20:32:53.3564931Z contiguous: bool, 2025-05-07T20:32:53.3565270Z compiled: bool, 2025-05-07T20:32:53.3565515Z ) -> None: 2025-05-07T20:32:53.3565729Z torch.manual_seed(2025) 2025-05-07T20:32:53.3565981Z 2025-05-07T20:32:53.3566264Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.3566604Z 2025-05-07T20:32:53.3566806Z x_sign = torch.sign(x) 2025-05-07T20:32:53.3567103Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.3567413Z x = x_sign * x_clamp 2025-05-07T20:32:53.3567663Z x0 = x[:, :D] 2025-05-07T20:32:53.3567890Z x1 = x[:, D:] 2025-05-07T20:32:53.3568111Z 2025-05-07T20:32:53.3568302Z if contiguous: 2025-05-07T20:32:53.3568544Z x0 = x0.contiguous() 2025-05-07T20:32:53.3568811Z x1 = x1.contiguous() 2025-05-07T20:32:53.3569053Z 2025-05-07T20:32:53.3569254Z if scale_ub is not None: 2025-05-07T20:32:53.3569537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.3569875Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.3570195Z ) 2025-05-07T20:32:53.3570396Z else: 2025-05-07T20:32:53.3570614Z scale_ub_tensor = None 2025-05-07T20:32:53.3570872Z 2025-05-07T20:32:53.3571109Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.3571424Z op = silu_mul_quant 2025-05-07T20:32:53.3571685Z if compiled: 2025-05-07T20:32:53.3571944Z op = torch.compile(op) 2025-05-07T20:32:53.3572240Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3572646Z 2025-05-07T20:32:53.3572843Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.3573015Z 2025-05-07T20:32:53.3573118Z moe/activation_test.py:117: 2025-05-07T20:32:53.3573423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3573934Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.3574234Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3574805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.3575474Z return fn(*args, **kwargs) 2025-05-07T20:32:53.3576122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.3576807Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.3577347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.3578014Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.3578678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.3579209Z kernel = self.compile( 2025-05-07T20:32:53.3579802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.3580488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.3580883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3581118Z 2025-05-07T20:32:53.3581326Z self = 2025-05-07T20:32:53.3582394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.3583851Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3aa660>} 2025-05-07T20:32:53.3585231Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.3586234Z context = 2025-05-07T20:32:53.3586526Z 2025-05-07T20:32:53.3586691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.3587210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.3587681Z module_map=module_map) 2025-05-07T20:32:53.3588041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.3588402Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.3588672Z E ^ 2025-05-07T20:32:53.3589129Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.3589576Z 2025-05-07T20:32:53.3589993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.3590510Z 2025-05-07T20:32:53.3590620Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.3591040Z self=, 2025-05-07T20:32:53.3591441Z T=4096, 2025-05-07T20:32:53.3591646Z D=5120, 2025-05-07T20:32:53.3591851Z scale_ub=1200.0, 2025-05-07T20:32:53.3592075Z contiguous=True, 2025-05-07T20:32:53.3592308Z compiled=True, 2025-05-07T20:32:53.3592523Z ) 2025-05-07T20:32:53.3592852Z self = 2025-05-07T20:32:53.3593338Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.3593671Z 2025-05-07T20:32:53.3593754Z @given( 2025-05-07T20:32:53.3593996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.3594312Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.3594633Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.3594977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.3595306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.3595598Z ) 2025-05-07T20:32:53.3595956Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.3596452Z def test_silu_mul_quant( 2025-05-07T20:32:53.3596693Z self, 2025-05-07T20:32:53.3596901Z T: int, 2025-05-07T20:32:53.3597108Z D: int, 2025-05-07T20:32:53.3597325Z scale_ub: Optional[float], 2025-05-07T20:32:53.3597604Z contiguous: bool, 2025-05-07T20:32:53.3597852Z compiled: bool, 2025-05-07T20:32:53.3598076Z ) -> None: 2025-05-07T20:32:53.3598675Z torch.manual_seed(2025) 2025-05-07T20:32:53.3598928Z 2025-05-07T20:32:53.3599198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.3599547Z 2025-05-07T20:32:53.3599754Z x_sign = torch.sign(x) 2025-05-07T20:32:53.3600132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.3600458Z x = x_sign * x_clamp 2025-05-07T20:32:53.3600713Z x0 = x[:, :D] 2025-05-07T20:32:53.3600933Z x1 = x[:, D:] 2025-05-07T20:32:53.3601158Z 2025-05-07T20:32:53.3601351Z if contiguous: 2025-05-07T20:32:53.3601582Z x0 = x0.contiguous() 2025-05-07T20:32:53.3601847Z x1 = x1.contiguous() 2025-05-07T20:32:53.3602094Z 2025-05-07T20:32:53.3602285Z if scale_ub is not None: 2025-05-07T20:32:53.3602564Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.3602905Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.3603224Z ) 2025-05-07T20:32:53.3603421Z else: 2025-05-07T20:32:53.3603639Z scale_ub_tensor = None 2025-05-07T20:32:53.3603900Z 2025-05-07T20:32:53.3604130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.3604455Z op = silu_mul_quant 2025-05-07T20:32:53.3604713Z if compiled: 2025-05-07T20:32:53.3605029Z op = torch.compile(op) 2025-05-07T20:32:53.3605335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3605625Z 2025-05-07T20:32:53.3605815Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.3605989Z 2025-05-07T20:32:53.3606091Z moe/activation_test.py:117: 2025-05-07T20:32:53.3606390Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3606729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.3607012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.3607570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.3608133Z return fn(*args, **kwargs) 2025-05-07T20:32:53.3608782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.3609470Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.3610017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.3610694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.3611358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.3611896Z kernel = self.compile( 2025-05-07T20:32:53.3612439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.3613089Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.3613565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.3613939Z 2025-05-07T20:32:53.3614146Z self = 2025-05-07T20:32:53.3615220Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.3616645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3ab9c0>} 2025-05-07T20:32:53.3617968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.3618981Z context = 2025-05-07T20:32:53.3619271Z 2025-05-07T20:32:53.3619446Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.3619967Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.3620470Z module_map=module_map) 2025-05-07T20:32:53.3620844Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.3621204Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.3621468Z E ^ 2025-05-07T20:32:53.3621934Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.3622374Z 2025-05-07T20:32:53.3622792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5300206Z 2025-05-07T20:32:53.5300754Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5301358Z self=, 2025-05-07T20:32:53.5301777Z T=128, 2025-05-07T20:32:53.5301983Z D=5120, 2025-05-07T20:32:53.5302179Z scale_ub=1200.0, 2025-05-07T20:32:53.5302412Z contiguous=False, 2025-05-07T20:32:53.5302651Z compiled=True, 2025-05-07T20:32:53.5302867Z ) 2025-05-07T20:32:53.5303498Z self = 2025-05-07T20:32:53.5303996Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.5304273Z 2025-05-07T20:32:53.5304361Z @given( 2025-05-07T20:32:53.5304592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5304916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5305229Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5305555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5305884Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5306182Z ) 2025-05-07T20:32:53.5306559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5307005Z def test_silu_mul_quant( 2025-05-07T20:32:53.5307247Z self, 2025-05-07T20:32:53.5307451Z T: int, 2025-05-07T20:32:53.5307662Z D: int, 2025-05-07T20:32:53.5307880Z scale_ub: Optional[float], 2025-05-07T20:32:53.5308161Z contiguous: bool, 2025-05-07T20:32:53.5308410Z compiled: bool, 2025-05-07T20:32:53.5308647Z ) -> None: 2025-05-07T20:32:53.5308864Z torch.manual_seed(2025) 2025-05-07T20:32:53.5309112Z 2025-05-07T20:32:53.5309394Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5309739Z 2025-05-07T20:32:53.5309941Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5310236Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5310551Z x = x_sign * x_clamp 2025-05-07T20:32:53.5310888Z x0 = x[:, :D] 2025-05-07T20:32:53.5311119Z x1 = x[:, D:] 2025-05-07T20:32:53.5311327Z 2025-05-07T20:32:53.5311530Z if contiguous: 2025-05-07T20:32:53.5311778Z x0 = x0.contiguous() 2025-05-07T20:32:53.5312038Z x1 = x1.contiguous() 2025-05-07T20:32:53.5312291Z 2025-05-07T20:32:53.5312498Z if scale_ub is not None: 2025-05-07T20:32:53.5312775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5313118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5313560Z ) 2025-05-07T20:32:53.5313760Z else: 2025-05-07T20:32:53.5313981Z scale_ub_tensor = None 2025-05-07T20:32:53.5314240Z 2025-05-07T20:32:53.5314478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5314794Z op = silu_mul_quant 2025-05-07T20:32:53.5315049Z if compiled: 2025-05-07T20:32:53.5315304Z op = torch.compile(op) 2025-05-07T20:32:53.5315603Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5315891Z 2025-05-07T20:32:53.5316092Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5316255Z 2025-05-07T20:32:53.5316356Z moe/activation_test.py:117: 2025-05-07T20:32:53.5316742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5317088Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5317370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5317932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.5318502Z return fn(*args, **kwargs) 2025-05-07T20:32:53.5319158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5319835Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5320374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5321054Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5321733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5322271Z kernel = self.compile( 2025-05-07T20:32:53.5322849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5323508Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5323919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5324144Z 2025-05-07T20:32:53.5324359Z self = 2025-05-07T20:32:53.5332664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5334211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a548fe0>} 2025-05-07T20:32:53.5335570Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5336597Z context = 2025-05-07T20:32:53.5336891Z 2025-05-07T20:32:53.5337072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5337593Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5338074Z module_map=module_map) 2025-05-07T20:32:53.5338450Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5338899Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5339166Z E ^ 2025-05-07T20:32:53.5339644Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5340087Z 2025-05-07T20:32:53.5340522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.5341036Z 2025-05-07T20:32:53.5341153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.5341617Z self=, 2025-05-07T20:32:53.5342033Z T=16384, 2025-05-07T20:32:53.5342242Z D=7168, 2025-05-07T20:32:53.5342444Z scale_ub=1200.0, 2025-05-07T20:32:53.5342681Z contiguous=True, 2025-05-07T20:32:53.5342920Z compiled=True, 2025-05-07T20:32:53.5343132Z ) 2025-05-07T20:32:53.5343464Z self = 2025-05-07T20:32:53.5343968Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.5344245Z 2025-05-07T20:32:53.5344328Z @given( 2025-05-07T20:32:53.5344574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.5344900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.5345267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.5345603Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.5345941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.5346241Z ) 2025-05-07T20:32:53.5346585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.5347036Z def test_silu_mul_quant( 2025-05-07T20:32:53.5347286Z self, 2025-05-07T20:32:53.5347483Z T: int, 2025-05-07T20:32:53.5347691Z D: int, 2025-05-07T20:32:53.5347920Z scale_ub: Optional[float], 2025-05-07T20:32:53.5348192Z contiguous: bool, 2025-05-07T20:32:53.5348442Z compiled: bool, 2025-05-07T20:32:53.5348676Z ) -> None: 2025-05-07T20:32:53.5348892Z torch.manual_seed(2025) 2025-05-07T20:32:53.5349141Z 2025-05-07T20:32:53.5349422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.5349763Z 2025-05-07T20:32:53.5349971Z x_sign = torch.sign(x) 2025-05-07T20:32:53.5350314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.5350625Z x = x_sign * x_clamp 2025-05-07T20:32:53.5350871Z x0 = x[:, :D] 2025-05-07T20:32:53.5351096Z x1 = x[:, D:] 2025-05-07T20:32:53.5351312Z 2025-05-07T20:32:53.5351497Z if contiguous: 2025-05-07T20:32:53.5351734Z x0 = x0.contiguous() 2025-05-07T20:32:53.5351999Z x1 = x1.contiguous() 2025-05-07T20:32:53.5352238Z 2025-05-07T20:32:53.5352437Z if scale_ub is not None: 2025-05-07T20:32:53.5352717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.5353055Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.5353368Z ) 2025-05-07T20:32:53.5353571Z else: 2025-05-07T20:32:53.5353781Z scale_ub_tensor = None 2025-05-07T20:32:53.5354035Z 2025-05-07T20:32:53.5354275Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.5354586Z op = silu_mul_quant 2025-05-07T20:32:53.5354844Z if compiled: 2025-05-07T20:32:53.5355098Z op = torch.compile(op) 2025-05-07T20:32:53.5355390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5355676Z 2025-05-07T20:32:53.5355874Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.5356038Z 2025-05-07T20:32:53.5356144Z moe/activation_test.py:117: 2025-05-07T20:32:53.5356436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5356772Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.5357059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.5357776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.5358391Z return fn(*args, **kwargs) 2025-05-07T20:32:53.5359056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.5359752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.5360285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.5361048Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.5361714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.5363628Z kernel = self.compile( 2025-05-07T20:32:53.5364177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.5364839Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.5365246Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.5365475Z 2025-05-07T20:32:53.5365727Z self = 2025-05-07T20:32:53.5366806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.5368165Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a549e40>} 2025-05-07T20:32:53.5369500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.5370504Z context = 2025-05-07T20:32:53.5370801Z 2025-05-07T20:32:53.5370968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.5371492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.5372007Z module_map=module_map) 2025-05-07T20:32:53.5372373Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.5372738Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.5373006Z E ^ 2025-05-07T20:32:53.5373463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.5373995Z 2025-05-07T20:32:53.5374407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.6520727Z 2025-05-07T20:32:53.6521098Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6521572Z self=, 2025-05-07T20:32:53.6522120Z T=16384, 2025-05-07T20:32:53.6522326Z D=5120, 2025-05-07T20:32:53.6522533Z scale_ub=1200.0, 2025-05-07T20:32:53.6522770Z contiguous=True, 2025-05-07T20:32:53.6522995Z compiled=False, 2025-05-07T20:32:53.6523213Z ) 2025-05-07T20:32:53.6523536Z self = 2025-05-07T20:32:53.6524034Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:53.6524315Z 2025-05-07T20:32:53.6524395Z @given( 2025-05-07T20:32:53.6524630Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6524956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6525266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6525597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6526169Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6526456Z ) 2025-05-07T20:32:53.6527023Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6527474Z def test_silu_mul_quant( 2025-05-07T20:32:53.6527723Z self, 2025-05-07T20:32:53.6527930Z T: int, 2025-05-07T20:32:53.6528140Z D: int, 2025-05-07T20:32:53.6528358Z scale_ub: Optional[float], 2025-05-07T20:32:53.6528637Z contiguous: bool, 2025-05-07T20:32:53.6528966Z compiled: bool, 2025-05-07T20:32:53.6529193Z ) -> None: 2025-05-07T20:32:53.6529416Z torch.manual_seed(2025) 2025-05-07T20:32:53.6529665Z 2025-05-07T20:32:53.6529943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6530295Z 2025-05-07T20:32:53.6530496Z x_sign = torch.sign(x) 2025-05-07T20:32:53.6530792Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.6531106Z x = x_sign * x_clamp 2025-05-07T20:32:53.6531363Z x0 = x[:, :D] 2025-05-07T20:32:53.6531588Z x1 = x[:, D:] 2025-05-07T20:32:53.6531805Z 2025-05-07T20:32:53.6531997Z if contiguous: 2025-05-07T20:32:53.6532230Z x0 = x0.contiguous() 2025-05-07T20:32:53.6532568Z x1 = x1.contiguous() 2025-05-07T20:32:53.6532817Z 2025-05-07T20:32:53.6533017Z if scale_ub is not None: 2025-05-07T20:32:53.6533287Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.6533623Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.6534116Z ) 2025-05-07T20:32:53.6534314Z else: 2025-05-07T20:32:53.6534537Z scale_ub_tensor = None 2025-05-07T20:32:53.6534791Z 2025-05-07T20:32:53.6535018Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.6535341Z op = silu_mul_quant 2025-05-07T20:32:53.6535587Z if compiled: 2025-05-07T20:32:53.6535862Z op = torch.compile(op) 2025-05-07T20:32:53.6536164Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6536436Z 2025-05-07T20:32:53.6536632Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.6536793Z 2025-05-07T20:32:53.6536900Z moe/activation_test.py:117: 2025-05-07T20:32:53.6537459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6537796Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.6538077Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6538763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.6539450Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.6539988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.6540665Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.6541315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.6541846Z kernel = self.compile( 2025-05-07T20:32:53.6542390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.6543036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.6543434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6543667Z 2025-05-07T20:32:53.6543872Z self = 2025-05-07T20:32:53.6544932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.6546293Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a54aca0>} 2025-05-07T20:32:53.6547656Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.6548837Z context = 2025-05-07T20:32:53.6549127Z 2025-05-07T20:32:53.6549290Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.6549847Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.6550301Z module_map=module_map) 2025-05-07T20:32:53.6550667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.6551019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.6551273Z E ^ 2025-05-07T20:32:53.6551734Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.6552181Z 2025-05-07T20:32:53.6552587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.6553642Z 2025-05-07T20:32:53.6553760Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6554161Z self=, 2025-05-07T20:32:53.6554558Z T=1, 2025-05-07T20:32:53.6554747Z D=7168, 2025-05-07T20:32:53.6554939Z scale_ub=1200.0, 2025-05-07T20:32:53.6555165Z contiguous=False, 2025-05-07T20:32:53.6555386Z compiled=False, 2025-05-07T20:32:53.6555605Z ) 2025-05-07T20:32:53.6555924Z self = 2025-05-07T20:32:53.6556400Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:53.6556667Z 2025-05-07T20:32:53.6556748Z @given( 2025-05-07T20:32:53.6556982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.6557298Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.6557596Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.6557931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.6558306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.6558588Z ) 2025-05-07T20:32:53.6559131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.6559576Z def test_silu_mul_quant( 2025-05-07T20:32:53.6559812Z self, 2025-05-07T20:32:53.6560016Z T: int, 2025-05-07T20:32:53.6560215Z D: int, 2025-05-07T20:32:53.6560426Z scale_ub: Optional[float], 2025-05-07T20:32:53.6560694Z contiguous: bool, 2025-05-07T20:32:53.6560932Z compiled: bool, 2025-05-07T20:32:53.6561146Z ) -> None: 2025-05-07T20:32:53.6561361Z torch.manual_seed(2025) 2025-05-07T20:32:53.6561602Z 2025-05-07T20:32:53.6561862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.6562199Z 2025-05-07T20:32:53.6562392Z x_sign = torch.sign(x) 2025-05-07T20:32:53.6562685Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.6562987Z x = x_sign * x_clamp 2025-05-07T20:32:53.6563230Z x0 = x[:, :D] 2025-05-07T20:32:53.6563447Z x1 = x[:, D:] 2025-05-07T20:32:53.6563652Z 2025-05-07T20:32:53.6563840Z if contiguous: 2025-05-07T20:32:53.6564071Z x0 = x0.contiguous() 2025-05-07T20:32:53.6564321Z x1 = x1.contiguous() 2025-05-07T20:32:53.6564559Z 2025-05-07T20:32:53.6564754Z if scale_ub is not None: 2025-05-07T20:32:53.6565018Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.6565348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.6565654Z ) 2025-05-07T20:32:53.6565894Z else: 2025-05-07T20:32:53.6566105Z scale_ub_tensor = None 2025-05-07T20:32:53.6566357Z 2025-05-07T20:32:53.6566580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.6566900Z op = silu_mul_quant 2025-05-07T20:32:53.6567150Z if compiled: 2025-05-07T20:32:53.6567397Z op = torch.compile(op) 2025-05-07T20:32:53.6567690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6567965Z 2025-05-07T20:32:53.6568156Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.6568364Z 2025-05-07T20:32:53.6568461Z moe/activation_test.py:117: 2025-05-07T20:32:53.6568757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6569087Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.6569361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.6570039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.6570724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.6571255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.6571967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.6572626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.6573155Z kernel = self.compile( 2025-05-07T20:32:53.6573770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.6574420Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.6574815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.6575038Z 2025-05-07T20:32:53.6575251Z self = 2025-05-07T20:32:53.6576313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.6577705Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04c0e0>} 2025-05-07T20:32:53.6579023Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.6580025Z context = 2025-05-07T20:32:53.6580306Z 2025-05-07T20:32:53.6580477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.6580985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.6581447Z module_map=module_map) 2025-05-07T20:32:53.6581806Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.6582150Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.6582410Z E ^ 2025-05-07T20:32:53.6582865Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.6583304Z 2025-05-07T20:32:53.6583718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.6584224Z 2025-05-07T20:32:53.6584326Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.6584736Z self=, 2025-05-07T20:32:53.6585134Z T=4096, 2025-05-07T20:32:53.6585314Z D=7168, 2025-05-07T20:32:53.6585508Z scale_ub=1200.0, 2025-05-07T20:32:53.6585807Z contiguous=False, 2025-05-07T20:32:53.6586025Z compiled=True, 2025-05-07T20:32:53.8197956Z ) 2025-05-07T20:32:53.8198667Z self = 2025-05-07T20:32:53.8199318Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.8199628Z 2025-05-07T20:32:53.8199709Z @given( 2025-05-07T20:32:53.8199951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8200270Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8200895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8201223Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8201552Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8201834Z ) 2025-05-07T20:32:53.8202186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8202629Z def test_silu_mul_quant( 2025-05-07T20:32:53.8202871Z self, 2025-05-07T20:32:53.8203078Z T: int, 2025-05-07T20:32:53.8203279Z D: int, 2025-05-07T20:32:53.8203500Z scale_ub: Optional[float], 2025-05-07T20:32:53.8203781Z contiguous: bool, 2025-05-07T20:32:53.8204024Z compiled: bool, 2025-05-07T20:32:53.8204256Z ) -> None: 2025-05-07T20:32:53.8204573Z torch.manual_seed(2025) 2025-05-07T20:32:53.8204828Z 2025-05-07T20:32:53.8205101Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8205449Z 2025-05-07T20:32:53.8205650Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8205944Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8206255Z x = x_sign * x_clamp 2025-05-07T20:32:53.8206501Z x0 = x[:, :D] 2025-05-07T20:32:53.8206724Z x1 = x[:, D:] 2025-05-07T20:32:53.8206926Z 2025-05-07T20:32:53.8207118Z if contiguous: 2025-05-07T20:32:53.8207350Z x0 = x0.contiguous() 2025-05-07T20:32:53.8207603Z x1 = x1.contiguous() 2025-05-07T20:32:53.8207845Z 2025-05-07T20:32:53.8208043Z if scale_ub is not None: 2025-05-07T20:32:53.8208314Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8208650Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8208965Z ) 2025-05-07T20:32:53.8209156Z else: 2025-05-07T20:32:53.8209496Z scale_ub_tensor = None 2025-05-07T20:32:53.8209746Z 2025-05-07T20:32:53.8209983Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8210308Z op = silu_mul_quant 2025-05-07T20:32:53.8210551Z if compiled: 2025-05-07T20:32:53.8210800Z op = torch.compile(op) 2025-05-07T20:32:53.8211094Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8211365Z 2025-05-07T20:32:53.8211559Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8211721Z 2025-05-07T20:32:53.8211826Z moe/activation_test.py:117: 2025-05-07T20:32:53.8212126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8212455Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8212737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8213301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.8213941Z return fn(*args, **kwargs) 2025-05-07T20:32:53.8214595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8215275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8215808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8216478Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8217135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8217791Z kernel = self.compile( 2025-05-07T20:32:53.8218324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8218973Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8219383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8219610Z 2025-05-07T20:32:53.8219825Z self = 2025-05-07T20:32:53.8220961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8222337Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04d300>} 2025-05-07T20:32:53.8223674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8224723Z context = 2025-05-07T20:32:53.8225008Z 2025-05-07T20:32:53.8225182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8225690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8226161Z module_map=module_map) 2025-05-07T20:32:53.8226531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8226877Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8227137Z E ^ 2025-05-07T20:32:53.8227600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8228043Z 2025-05-07T20:32:53.8228460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8228964Z 2025-05-07T20:32:53.8229069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8229483Z self=, 2025-05-07T20:32:53.8229925Z T=128, 2025-05-07T20:32:53.8230112Z D=7168, 2025-05-07T20:32:53.8230309Z scale_ub=1200.0, 2025-05-07T20:32:53.8230535Z contiguous=False, 2025-05-07T20:32:53.8230755Z compiled=True, 2025-05-07T20:32:53.8230963Z ) 2025-05-07T20:32:53.8231280Z self = 2025-05-07T20:32:53.8231764Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:53.8232033Z 2025-05-07T20:32:53.8232111Z @given( 2025-05-07T20:32:53.8232343Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.8232660Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.8232959Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.8233289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.8233617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.8233901Z ) 2025-05-07T20:32:53.8234250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.8234689Z def test_silu_mul_quant( 2025-05-07T20:32:53.8234930Z self, 2025-05-07T20:32:53.8235124Z T: int, 2025-05-07T20:32:53.8235319Z D: int, 2025-05-07T20:32:53.8235539Z scale_ub: Optional[float], 2025-05-07T20:32:53.8235805Z contiguous: bool, 2025-05-07T20:32:53.8236042Z compiled: bool, 2025-05-07T20:32:53.8236262Z ) -> None: 2025-05-07T20:32:53.8236473Z torch.manual_seed(2025) 2025-05-07T20:32:53.8236714Z 2025-05-07T20:32:53.8236988Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.8237381Z 2025-05-07T20:32:53.8237576Z x_sign = torch.sign(x) 2025-05-07T20:32:53.8237885Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.8238189Z x = x_sign * x_clamp 2025-05-07T20:32:53.8238434Z x0 = x[:, :D] 2025-05-07T20:32:53.8238656Z x1 = x[:, D:] 2025-05-07T20:32:53.8238861Z 2025-05-07T20:32:53.8239053Z if contiguous: 2025-05-07T20:32:53.8239286Z x0 = x0.contiguous() 2025-05-07T20:32:53.8247272Z x1 = x1.contiguous() 2025-05-07T20:32:53.8247618Z 2025-05-07T20:32:53.8247815Z if scale_ub is not None: 2025-05-07T20:32:53.8248094Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.8248437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.8248743Z ) 2025-05-07T20:32:53.8248940Z else: 2025-05-07T20:32:53.8249160Z scale_ub_tensor = None 2025-05-07T20:32:53.8249411Z 2025-05-07T20:32:53.8249652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.8249974Z op = silu_mul_quant 2025-05-07T20:32:53.8250231Z if compiled: 2025-05-07T20:32:53.8250476Z op = torch.compile(op) 2025-05-07T20:32:53.8250832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8251113Z 2025-05-07T20:32:53.8251307Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.8251486Z 2025-05-07T20:32:53.8251589Z moe/activation_test.py:117: 2025-05-07T20:32:53.8251890Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8252220Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.8252502Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.8253063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.8253627Z return fn(*args, **kwargs) 2025-05-07T20:32:53.8254368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.8255053Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.8255588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.8256306Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.8256970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.8257510Z kernel = self.compile( 2025-05-07T20:32:53.8258053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.8258698Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.8259101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.8259328Z 2025-05-07T20:32:53.8259544Z self = 2025-05-07T20:32:53.8260619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.8261974Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04e160>} 2025-05-07T20:32:53.8263301Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.8264310Z context = 2025-05-07T20:32:53.8264597Z 2025-05-07T20:32:53.8264760Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.8265324Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.8265786Z module_map=module_map) 2025-05-07T20:32:53.8266143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.8266499Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.8266759Z E ^ 2025-05-07T20:32:53.8267218Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.8267701Z 2025-05-07T20:32:53.8268110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.8268623Z 2025-05-07T20:32:53.8268725Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.8269135Z self=, 2025-05-07T20:32:53.8269527Z T=2048, 2025-05-07T20:32:53.8269720Z D=7168, 2025-05-07T20:32:53.8269934Z scale_ub=None, 2025-05-07T20:32:53.8270183Z contiguous=True, 2025-05-07T20:32:53.8270400Z compiled=True, 2025-05-07T20:32:53.9559504Z ) 2025-05-07T20:32:53.9560326Z self = 2025-05-07T20:32:53.9561304Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:53.9561583Z 2025-05-07T20:32:53.9561674Z @given( 2025-05-07T20:32:53.9561906Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9562232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9562541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9562869Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9563202Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9563493Z ) 2025-05-07T20:32:53.9563844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9564284Z def test_silu_mul_quant( 2025-05-07T20:32:53.9564540Z self, 2025-05-07T20:32:53.9564742Z T: int, 2025-05-07T20:32:53.9564940Z D: int, 2025-05-07T20:32:53.9565163Z scale_ub: Optional[float], 2025-05-07T20:32:53.9565438Z contiguous: bool, 2025-05-07T20:32:53.9565681Z compiled: bool, 2025-05-07T20:32:53.9565915Z ) -> None: 2025-05-07T20:32:53.9566226Z torch.manual_seed(2025) 2025-05-07T20:32:53.9566469Z 2025-05-07T20:32:53.9566750Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9567098Z 2025-05-07T20:32:53.9567294Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9567594Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9567908Z x = x_sign * x_clamp 2025-05-07T20:32:53.9568151Z x0 = x[:, :D] 2025-05-07T20:32:53.9568373Z x1 = x[:, D:] 2025-05-07T20:32:53.9568585Z 2025-05-07T20:32:53.9568780Z if contiguous: 2025-05-07T20:32:53.9569011Z x0 = x0.contiguous() 2025-05-07T20:32:53.9569279Z x1 = x1.contiguous() 2025-05-07T20:32:53.9569529Z 2025-05-07T20:32:53.9569724Z if scale_ub is not None: 2025-05-07T20:32:53.9570005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:53.9570348Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:53.9570658Z ) 2025-05-07T20:32:53.9570861Z else: 2025-05-07T20:32:53.9571084Z scale_ub_tensor = None 2025-05-07T20:32:53.9571336Z 2025-05-07T20:32:53.9571581Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:53.9571905Z op = silu_mul_quant 2025-05-07T20:32:53.9572155Z if compiled: 2025-05-07T20:32:53.9572408Z op = torch.compile(op) 2025-05-07T20:32:53.9572711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9572991Z 2025-05-07T20:32:53.9573194Z > y_fp8, y_scale = fn() 2025-05-07T20:32:53.9573367Z 2025-05-07T20:32:53.9573555Z moe/activation_test.py:117: 2025-05-07T20:32:53.9573976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9574307Z moe/activation_test.py:115: in fn 2025-05-07T20:32:53.9574592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:53.9575154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:53.9575711Z return fn(*args, **kwargs) 2025-05-07T20:32:53.9576371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:53.9577144Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:53.9577681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:53.9578351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:53.9579015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:53.9579555Z kernel = self.compile( 2025-05-07T20:32:53.9580092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:53.9580795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:53.9581202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:53.9581430Z 2025-05-07T20:32:53.9581645Z self = 2025-05-07T20:32:53.9582710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:53.9584087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04f420>} 2025-05-07T20:32:53.9585423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:53.9586479Z context = 2025-05-07T20:32:53.9586766Z 2025-05-07T20:32:53.9586938Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:53.9587456Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:53.9587931Z module_map=module_map) 2025-05-07T20:32:53.9588299Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:53.9588650Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:53.9588918Z E ^ 2025-05-07T20:32:53.9589385Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:53.9589829Z 2025-05-07T20:32:53.9590246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:53.9590753Z 2025-05-07T20:32:53.9590863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9591279Z self=, 2025-05-07T20:32:53.9591687Z T=16384, 2025-05-07T20:32:53.9591881Z D=5120, 2025-05-07T20:32:53.9592087Z scale_ub=None, 2025-05-07T20:32:53.9592309Z contiguous=False, 2025-05-07T20:32:53.9592535Z compiled=False, 2025-05-07T20:32:53.9592747Z ) 2025-05-07T20:32:53.9593068Z self = 2025-05-07T20:32:53.9593568Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.9593843Z 2025-05-07T20:32:53.9593924Z @given( 2025-05-07T20:32:53.9594242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9594560Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9594868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9595207Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9595546Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9595830Z ) 2025-05-07T20:32:53.9596188Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9596638Z def test_silu_mul_quant( 2025-05-07T20:32:53.9596930Z self, 2025-05-07T20:32:53.9597124Z T: int, 2025-05-07T20:32:53.9597322Z D: int, 2025-05-07T20:32:53.9597544Z scale_ub: Optional[float], 2025-05-07T20:32:53.9597813Z contiguous: bool, 2025-05-07T20:32:53.9598059Z compiled: bool, 2025-05-07T20:32:53.9598643Z ) -> None: 2025-05-07T20:32:53.9598856Z torch.manual_seed(2025) 2025-05-07T20:32:53.9599096Z 2025-05-07T20:32:53.9599369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9599703Z 2025-05-07T20:32:53.9599901Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9600191Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9602322Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.9604146Z 2025-05-07T20:32:53.9604271Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.9604482Z 2025-05-07T20:32:53.9604586Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9604991Z self=, 2025-05-07T20:32:53.9605385Z T=4096, 2025-05-07T20:32:53.9605567Z D=7168, 2025-05-07T20:32:53.9605763Z scale_ub=1200.0, 2025-05-07T20:32:53.9605989Z contiguous=True, 2025-05-07T20:32:53.9606206Z compiled=True, 2025-05-07T20:32:53.9606474Z ) 2025-05-07T20:32:53.9606796Z self = 2025-05-07T20:32:53.9607278Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:53.9607552Z 2025-05-07T20:32:53.9607631Z @given( 2025-05-07T20:32:53.9607861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9608172Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9608471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9608799Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9609129Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9609407Z ) 2025-05-07T20:32:53.9609774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9610208Z def test_silu_mul_quant( 2025-05-07T20:32:53.9610457Z self, 2025-05-07T20:32:53.9610660Z T: int, 2025-05-07T20:32:53.9610853Z D: int, 2025-05-07T20:32:53.9611075Z scale_ub: Optional[float], 2025-05-07T20:32:53.9611345Z contiguous: bool, 2025-05-07T20:32:53.9611582Z compiled: bool, 2025-05-07T20:32:53.9611808Z ) -> None: 2025-05-07T20:32:53.9612027Z torch.manual_seed(2025) 2025-05-07T20:32:53.9612259Z 2025-05-07T20:32:53.9612530Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9612872Z 2025-05-07T20:32:53.9613064Z x_sign = torch.sign(x) 2025-05-07T20:32:53.9613354Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:53.9615531Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.9617442Z 2025-05-07T20:32:53.9617561Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:53.9617773Z 2025-05-07T20:32:53.9617891Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:53.9618290Z self=, 2025-05-07T20:32:53.9618688Z T=16384, 2025-05-07T20:32:53.9618881Z D=7168, 2025-05-07T20:32:53.9619069Z scale_ub=None, 2025-05-07T20:32:53.9619283Z contiguous=False, 2025-05-07T20:32:53.9619503Z compiled=False, 2025-05-07T20:32:53.9619702Z ) 2025-05-07T20:32:53.9620017Z self = 2025-05-07T20:32:53.9620564Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:53.9620837Z 2025-05-07T20:32:53.9620923Z @given( 2025-05-07T20:32:53.9621149Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:53.9621463Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:53.9621771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:53.9622091Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:53.9622422Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:53.9622812Z ) 2025-05-07T20:32:53.9623228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:53.9623676Z def test_silu_mul_quant( 2025-05-07T20:32:53.9623949Z self, 2025-05-07T20:32:53.9624189Z T: int, 2025-05-07T20:32:53.9624383Z D: int, 2025-05-07T20:32:53.9624613Z scale_ub: Optional[float], 2025-05-07T20:32:53.9624885Z contiguous: bool, 2025-05-07T20:32:53.9625121Z compiled: bool, 2025-05-07T20:32:53.9625353Z ) -> None: 2025-05-07T20:32:53.9625570Z torch.manual_seed(2025) 2025-05-07T20:32:53.9625872Z 2025-05-07T20:32:53.9626143Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:53.9628153Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:53.9632971Z 2025-05-07T20:32:53.9633108Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.0890809Z 2025-05-07T20:32:54.0891149Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0891615Z self=, 2025-05-07T20:32:54.0892024Z T=2048, 2025-05-07T20:32:54.0892220Z D=7168, 2025-05-07T20:32:54.0892413Z scale_ub=1200.0, 2025-05-07T20:32:54.0892648Z contiguous=True, 2025-05-07T20:32:54.0892868Z compiled=True, 2025-05-07T20:32:54.0893072Z ) 2025-05-07T20:32:54.0893395Z self = 2025-05-07T20:32:54.0893971Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.0894241Z 2025-05-07T20:32:54.0894320Z @given( 2025-05-07T20:32:54.0894555Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0894868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0895175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0895526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0895860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0896136Z ) 2025-05-07T20:32:54.0896489Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0896932Z def test_silu_mul_quant( 2025-05-07T20:32:54.0897362Z self, 2025-05-07T20:32:54.0897554Z T: int, 2025-05-07T20:32:54.0897783Z D: int, 2025-05-07T20:32:54.0898003Z scale_ub: Optional[float], 2025-05-07T20:32:54.0898525Z contiguous: bool, 2025-05-07T20:32:54.0898767Z compiled: bool, 2025-05-07T20:32:54.0898992Z ) -> None: 2025-05-07T20:32:54.0899208Z torch.manual_seed(2025) 2025-05-07T20:32:54.0899448Z 2025-05-07T20:32:54.0899712Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0900049Z 2025-05-07T20:32:54.0900244Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0900531Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0902599Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0904434Z 2025-05-07T20:32:54.0904553Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.0904762Z 2025-05-07T20:32:54.0904870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0905280Z self=, 2025-05-07T20:32:54.0905677Z T=2048, 2025-05-07T20:32:54.0905867Z D=7168, 2025-05-07T20:32:54.0906064Z scale_ub=None, 2025-05-07T20:32:54.0906271Z contiguous=True, 2025-05-07T20:32:54.0906500Z compiled=False, 2025-05-07T20:32:54.0906708Z ) 2025-05-07T20:32:54.0907106Z self = 2025-05-07T20:32:54.0907598Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.0907868Z 2025-05-07T20:32:54.0907954Z @given( 2025-05-07T20:32:54.0908177Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0908489Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0908791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0909115Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0909434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0909717Z ) 2025-05-07T20:32:54.0910060Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0910625Z def test_silu_mul_quant( 2025-05-07T20:32:54.0910865Z self, 2025-05-07T20:32:54.0911060Z T: int, 2025-05-07T20:32:54.0911253Z D: int, 2025-05-07T20:32:54.0911474Z scale_ub: Optional[float], 2025-05-07T20:32:54.0911742Z contiguous: bool, 2025-05-07T20:32:54.0911973Z compiled: bool, 2025-05-07T20:32:54.0912200Z ) -> None: 2025-05-07T20:32:54.0912420Z torch.manual_seed(2025) 2025-05-07T20:32:54.0912651Z 2025-05-07T20:32:54.0912919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0913266Z 2025-05-07T20:32:54.0913454Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.0915354Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.0917164Z 2025-05-07T20:32:54.0917280Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.0917566Z 2025-05-07T20:32:54.0917666Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0918074Z self=, 2025-05-07T20:32:54.0918465Z T=1, 2025-05-07T20:32:54.0918654Z D=7168, 2025-05-07T20:32:54.0918849Z scale_ub=1200.0, 2025-05-07T20:32:54.0919061Z contiguous=True, 2025-05-07T20:32:54.0919281Z compiled=False, 2025-05-07T20:32:54.0919484Z ) 2025-05-07T20:32:54.0919789Z self = 2025-05-07T20:32:54.0920269Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.0920537Z 2025-05-07T20:32:54.0920615Z @given( 2025-05-07T20:32:54.0920891Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0921198Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0921503Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0921829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0922153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0922437Z ) 2025-05-07T20:32:54.0922780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0923217Z def test_silu_mul_quant( 2025-05-07T20:32:54.0923453Z self, 2025-05-07T20:32:54.0923647Z T: int, 2025-05-07T20:32:54.0923844Z D: int, 2025-05-07T20:32:54.0924053Z scale_ub: Optional[float], 2025-05-07T20:32:54.0924319Z contiguous: bool, 2025-05-07T20:32:54.0924558Z compiled: bool, 2025-05-07T20:32:54.0924772Z ) -> None: 2025-05-07T20:32:54.0924982Z torch.manual_seed(2025) 2025-05-07T20:32:54.0925225Z 2025-05-07T20:32:54.0925534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0925879Z 2025-05-07T20:32:54.0926070Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0926352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0926663Z x = x_sign * x_clamp 2025-05-07T20:32:54.0926905Z x0 = x[:, :D] 2025-05-07T20:32:54.0927117Z x1 = x[:, D:] 2025-05-07T20:32:54.0927325Z 2025-05-07T20:32:54.0927515Z if contiguous: 2025-05-07T20:32:54.0927740Z x0 = x0.contiguous() 2025-05-07T20:32:54.0928005Z x1 = x1.contiguous() 2025-05-07T20:32:54.0928244Z 2025-05-07T20:32:54.0928430Z if scale_ub is not None: 2025-05-07T20:32:54.0928702Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0929034Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0929427Z ) 2025-05-07T20:32:54.0929619Z else: 2025-05-07T20:32:54.0929833Z scale_ub_tensor = None 2025-05-07T20:32:54.0930113Z 2025-05-07T20:32:54.0930371Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0930683Z op = silu_mul_quant 2025-05-07T20:32:54.0930934Z if compiled: 2025-05-07T20:32:54.0931176Z op = torch.compile(op) 2025-05-07T20:32:54.0931472Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0940111Z 2025-05-07T20:32:54.0940338Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.0940520Z 2025-05-07T20:32:54.0940625Z moe/activation_test.py:117: 2025-05-07T20:32:54.0940932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0941278Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.0941563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0942263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.0942977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.0943515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0944199Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0944947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0945487Z kernel = self.compile( 2025-05-07T20:32:54.0946028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0946687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0947094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0947329Z 2025-05-07T20:32:54.0947545Z self = 2025-05-07T20:32:54.0948658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0950028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9ff62a0>} 2025-05-07T20:32:54.0951359Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0952372Z context = 2025-05-07T20:32:54.0952657Z 2025-05-07T20:32:54.0952823Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0953355Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0953826Z module_map=module_map) 2025-05-07T20:32:54.0954242Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0954597Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.0954866Z E ^ 2025-05-07T20:32:54.0955333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0955776Z 2025-05-07T20:32:54.0956185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.0956703Z 2025-05-07T20:32:54.0956809Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.0957223Z self=, 2025-05-07T20:32:54.0957625Z T=128, 2025-05-07T20:32:54.0957896Z D=5120, 2025-05-07T20:32:54.0958091Z scale_ub=None, 2025-05-07T20:32:54.0958312Z contiguous=True, 2025-05-07T20:32:54.0958543Z compiled=False, 2025-05-07T20:32:54.0958748Z ) 2025-05-07T20:32:54.0959071Z self = 2025-05-07T20:32:54.0959560Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.0959827Z 2025-05-07T20:32:54.0959916Z @given( 2025-05-07T20:32:54.0960171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.0960512Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.0960822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.0961149Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.0961481Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.0961771Z ) 2025-05-07T20:32:54.0962116Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.0962567Z def test_silu_mul_quant( 2025-05-07T20:32:54.0962812Z self, 2025-05-07T20:32:54.0963017Z T: int, 2025-05-07T20:32:54.0963220Z D: int, 2025-05-07T20:32:54.0963443Z scale_ub: Optional[float], 2025-05-07T20:32:54.0963716Z contiguous: bool, 2025-05-07T20:32:54.0963952Z compiled: bool, 2025-05-07T20:32:54.0964178Z ) -> None: 2025-05-07T20:32:54.0964457Z torch.manual_seed(2025) 2025-05-07T20:32:54.0964695Z 2025-05-07T20:32:54.0964975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.0965320Z 2025-05-07T20:32:54.0965510Z x_sign = torch.sign(x) 2025-05-07T20:32:54.0965803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.0966115Z x = x_sign * x_clamp 2025-05-07T20:32:54.0966354Z x0 = x[:, :D] 2025-05-07T20:32:54.0966579Z x1 = x[:, D:] 2025-05-07T20:32:54.0966789Z 2025-05-07T20:32:54.0966973Z if contiguous: 2025-05-07T20:32:54.0967213Z x0 = x0.contiguous() 2025-05-07T20:32:54.0967475Z x1 = x1.contiguous() 2025-05-07T20:32:54.0967709Z 2025-05-07T20:32:54.0967959Z if scale_ub is not None: 2025-05-07T20:32:54.0968247Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.0968584Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.0968893Z ) 2025-05-07T20:32:54.0969090Z else: 2025-05-07T20:32:54.0969305Z scale_ub_tensor = None 2025-05-07T20:32:54.0969553Z 2025-05-07T20:32:54.0969787Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.0970107Z op = silu_mul_quant 2025-05-07T20:32:54.0970351Z if compiled: 2025-05-07T20:32:54.0970602Z op = torch.compile(op) 2025-05-07T20:32:54.0970900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0971172Z 2025-05-07T20:32:54.0971372Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.0971539Z 2025-05-07T20:32:54.0971644Z moe/activation_test.py:117: 2025-05-07T20:32:54.0971944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0972279Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.0972611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.0973300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.0974076Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.0974617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.0975301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.0975953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.0976492Z kernel = self.compile( 2025-05-07T20:32:54.0977035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.0977749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.0978147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.0978383Z 2025-05-07T20:32:54.0978589Z self = 2025-05-07T20:32:54.0979664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.0981017Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9ff71a0>} 2025-05-07T20:32:54.0982347Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.0983357Z context = 2025-05-07T20:32:54.0983649Z 2025-05-07T20:32:54.0983816Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.0984335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.0984843Z module_map=module_map) 2025-05-07T20:32:54.0985212Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.0985567Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.0985832Z E ^ 2025-05-07T20:32:54.0986286Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.0986735Z 2025-05-07T20:32:54.0987144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2113867Z 2025-05-07T20:32:54.2114527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2114993Z self=, 2025-05-07T20:32:54.2115405Z T=128, 2025-05-07T20:32:54.2115607Z D=7168, 2025-05-07T20:32:54.2115810Z scale_ub=None, 2025-05-07T20:32:54.2116033Z contiguous=True, 2025-05-07T20:32:54.2116267Z compiled=False, 2025-05-07T20:32:54.2116482Z ) 2025-05-07T20:32:54.2116808Z self = 2025-05-07T20:32:54.2117304Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.2117575Z 2025-05-07T20:32:54.2117658Z @given( 2025-05-07T20:32:54.2117900Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2118213Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2118525Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2118871Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2119205Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2119499Z ) 2025-05-07T20:32:54.2119949Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2120394Z def test_silu_mul_quant( 2025-05-07T20:32:54.2120649Z self, 2025-05-07T20:32:54.2120858Z T: int, 2025-05-07T20:32:54.2121058Z D: int, 2025-05-07T20:32:54.2121285Z scale_ub: Optional[float], 2025-05-07T20:32:54.2121564Z contiguous: bool, 2025-05-07T20:32:54.2121811Z compiled: bool, 2025-05-07T20:32:54.2122043Z ) -> None: 2025-05-07T20:32:54.2122273Z torch.manual_seed(2025) 2025-05-07T20:32:54.2122522Z 2025-05-07T20:32:54.2122831Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2123174Z 2025-05-07T20:32:54.2123380Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2123774Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2124092Z x = x_sign * x_clamp 2025-05-07T20:32:54.2124337Z x0 = x[:, :D] 2025-05-07T20:32:54.2124564Z x1 = x[:, D:] 2025-05-07T20:32:54.2124777Z 2025-05-07T20:32:54.2124964Z if contiguous: 2025-05-07T20:32:54.2125200Z x0 = x0.contiguous() 2025-05-07T20:32:54.2125469Z x1 = x1.contiguous() 2025-05-07T20:32:54.2125708Z 2025-05-07T20:32:54.2125906Z if scale_ub is not None: 2025-05-07T20:32:54.2126185Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2126518Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2126837Z ) 2025-05-07T20:32:54.2127037Z else: 2025-05-07T20:32:54.2127251Z scale_ub_tensor = None 2025-05-07T20:32:54.2127511Z 2025-05-07T20:32:54.2127753Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2128073Z op = silu_mul_quant 2025-05-07T20:32:54.2128327Z if compiled: 2025-05-07T20:32:54.2128585Z op = torch.compile(op) 2025-05-07T20:32:54.2128892Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2129171Z 2025-05-07T20:32:54.2129369Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2129533Z 2025-05-07T20:32:54.2129644Z moe/activation_test.py:117: 2025-05-07T20:32:54.2130032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2130421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2130714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2131400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2132092Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2132633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2133316Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2134118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2134662Z kernel = self.compile( 2025-05-07T20:32:54.2135211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2135889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2136297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2136527Z 2025-05-07T20:32:54.2136739Z self = 2025-05-07T20:32:54.2137806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2139193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9b60040>} 2025-05-07T20:32:54.2140571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2141593Z context = 2025-05-07T20:32:54.2141882Z 2025-05-07T20:32:54.2142048Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2142572Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2143042Z module_map=module_map) 2025-05-07T20:32:54.2143414Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2143815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2144086Z E ^ 2025-05-07T20:32:54.2144553Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2144999Z 2025-05-07T20:32:54.2145411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.2145930Z 2025-05-07T20:32:54.2146036Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2146451Z self=, 2025-05-07T20:32:54.2146855Z T=2048, 2025-05-07T20:32:54.2147044Z D=7168, 2025-05-07T20:32:54.2147242Z scale_ub=1200.0, 2025-05-07T20:32:54.2147470Z contiguous=True, 2025-05-07T20:32:54.2147691Z compiled=False, 2025-05-07T20:32:54.2147899Z ) 2025-05-07T20:32:54.2148219Z self = 2025-05-07T20:32:54.2148708Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2148990Z 2025-05-07T20:32:54.2149068Z @given( 2025-05-07T20:32:54.2149309Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2149622Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2149931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2150320Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2150651Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2150935Z ) 2025-05-07T20:32:54.2151286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2151733Z def test_silu_mul_quant( 2025-05-07T20:32:54.2151971Z self, 2025-05-07T20:32:54.2152173Z T: int, 2025-05-07T20:32:54.2152376Z D: int, 2025-05-07T20:32:54.2152591Z scale_ub: Optional[float], 2025-05-07T20:32:54.2152869Z contiguous: bool, 2025-05-07T20:32:54.2153117Z compiled: bool, 2025-05-07T20:32:54.2153338Z ) -> None: 2025-05-07T20:32:54.2153562Z torch.manual_seed(2025) 2025-05-07T20:32:54.2153810Z 2025-05-07T20:32:54.2154129Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2156149Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.2157967Z 2025-05-07T20:32:54.2158090Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.2158306Z 2025-05-07T20:32:54.2158414Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.2158827Z self=, 2025-05-07T20:32:54.2159226Z T=1, 2025-05-07T20:32:54.2159414Z D=5120, 2025-05-07T20:32:54.2159688Z scale_ub=1200.0, 2025-05-07T20:32:54.2159915Z contiguous=True, 2025-05-07T20:32:54.2160147Z compiled=False, 2025-05-07T20:32:54.2160362Z ) 2025-05-07T20:32:54.2160676Z self = 2025-05-07T20:32:54.2161158Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.2161426Z 2025-05-07T20:32:54.2161506Z @given( 2025-05-07T20:32:54.2161738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.2162047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.2162355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.2162687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.2163064Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.2163353Z ) 2025-05-07T20:32:54.2163707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.2164155Z def test_silu_mul_quant( 2025-05-07T20:32:54.2164395Z self, 2025-05-07T20:32:54.2164595Z T: int, 2025-05-07T20:32:54.2164793Z D: int, 2025-05-07T20:32:54.2165010Z scale_ub: Optional[float], 2025-05-07T20:32:54.2165283Z contiguous: bool, 2025-05-07T20:32:54.2165524Z compiled: bool, 2025-05-07T20:32:54.2165742Z ) -> None: 2025-05-07T20:32:54.2165959Z torch.manual_seed(2025) 2025-05-07T20:32:54.2166204Z 2025-05-07T20:32:54.2166470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.2166814Z 2025-05-07T20:32:54.2167012Z x_sign = torch.sign(x) 2025-05-07T20:32:54.2167299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.2167614Z x = x_sign * x_clamp 2025-05-07T20:32:54.2167871Z x0 = x[:, :D] 2025-05-07T20:32:54.2168086Z x1 = x[:, D:] 2025-05-07T20:32:54.2168296Z 2025-05-07T20:32:54.2168488Z if contiguous: 2025-05-07T20:32:54.2168718Z x0 = x0.contiguous() 2025-05-07T20:32:54.2168980Z x1 = x1.contiguous() 2025-05-07T20:32:54.2169220Z 2025-05-07T20:32:54.2169408Z if scale_ub is not None: 2025-05-07T20:32:54.2169738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.2170075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.2170434Z ) 2025-05-07T20:32:54.2170625Z else: 2025-05-07T20:32:54.2170839Z scale_ub_tensor = None 2025-05-07T20:32:54.2171093Z 2025-05-07T20:32:54.2171321Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.2171639Z op = silu_mul_quant 2025-05-07T20:32:54.2171892Z if compiled: 2025-05-07T20:32:54.2172139Z op = torch.compile(op) 2025-05-07T20:32:54.2172446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2172725Z 2025-05-07T20:32:54.2172964Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.2173138Z 2025-05-07T20:32:54.2173240Z moe/activation_test.py:117: 2025-05-07T20:32:54.2173541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2173970Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.2174252Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.2174937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.2175627Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.2176161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.2176840Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.2177503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.2178049Z kernel = self.compile( 2025-05-07T20:32:54.2178636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.2179294Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.2179704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.2179932Z 2025-05-07T20:32:54.2180138Z self = 2025-05-07T20:32:54.2181207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.2182557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9b61580>} 2025-05-07T20:32:54.2183941Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.2184950Z context = 2025-05-07T20:32:54.2185235Z 2025-05-07T20:32:54.2185401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.2185921Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.2186388Z module_map=module_map) 2025-05-07T20:32:54.2186758Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.2187106Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.2187368Z E ^ 2025-05-07T20:32:54.2187829Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.2188276Z 2025-05-07T20:32:54.2188692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.3012338Z 2025-05-07T20:32:54.3012705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3013460Z self=, 2025-05-07T20:32:54.3013932Z T=2048, 2025-05-07T20:32:54.3014131Z D=5120, 2025-05-07T20:32:54.3014336Z scale_ub=None, 2025-05-07T20:32:54.3014551Z contiguous=True, 2025-05-07T20:32:54.3014783Z compiled=False, 2025-05-07T20:32:54.3015002Z ) 2025-05-07T20:32:54.3015319Z self = 2025-05-07T20:32:54.3015816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3016084Z 2025-05-07T20:32:54.3016184Z @given( 2025-05-07T20:32:54.3016418Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3016744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3017144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3017488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3017820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3018120Z ) 2025-05-07T20:32:54.3018472Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3018915Z def test_silu_mul_quant( 2025-05-07T20:32:54.3019165Z self, 2025-05-07T20:32:54.3019371Z T: int, 2025-05-07T20:32:54.3019575Z D: int, 2025-05-07T20:32:54.3019805Z scale_ub: Optional[float], 2025-05-07T20:32:54.3020082Z contiguous: bool, 2025-05-07T20:32:54.3020328Z compiled: bool, 2025-05-07T20:32:54.3020566Z ) -> None: 2025-05-07T20:32:54.3020792Z torch.manual_seed(2025) 2025-05-07T20:32:54.3021038Z 2025-05-07T20:32:54.3021318Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3021671Z 2025-05-07T20:32:54.3021883Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.3023873Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3025704Z 2025-05-07T20:32:54.3025825Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.3026044Z 2025-05-07T20:32:54.3026150Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3026647Z self=, 2025-05-07T20:32:54.3027046Z T=16384, 2025-05-07T20:32:54.3027252Z D=5120, 2025-05-07T20:32:54.3027452Z scale_ub=None, 2025-05-07T20:32:54.3027670Z contiguous=True, 2025-05-07T20:32:54.3027901Z compiled=False, 2025-05-07T20:32:54.3028111Z ) 2025-05-07T20:32:54.3028434Z self = 2025-05-07T20:32:54.3028924Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3029203Z 2025-05-07T20:32:54.3029284Z @given( 2025-05-07T20:32:54.3029519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3029829Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3030141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3030479Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3030805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3031102Z ) 2025-05-07T20:32:54.3031459Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3031906Z def test_silu_mul_quant( 2025-05-07T20:32:54.3032150Z self, 2025-05-07T20:32:54.3032353Z T: int, 2025-05-07T20:32:54.3032559Z D: int, 2025-05-07T20:32:54.3032778Z scale_ub: Optional[float], 2025-05-07T20:32:54.3033105Z contiguous: bool, 2025-05-07T20:32:54.3033351Z compiled: bool, 2025-05-07T20:32:54.3033575Z ) -> None: 2025-05-07T20:32:54.3033796Z torch.manual_seed(2025) 2025-05-07T20:32:54.3034043Z 2025-05-07T20:32:54.3034331Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3036384Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3038210Z 2025-05-07T20:32:54.3038338Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3038560Z 2025-05-07T20:32:54.3038667Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3039082Z self=, 2025-05-07T20:32:54.3039482Z T=4096, 2025-05-07T20:32:54.3039679Z D=5120, 2025-05-07T20:32:54.3039879Z scale_ub=None, 2025-05-07T20:32:54.3040096Z contiguous=True, 2025-05-07T20:32:54.3040364Z compiled=False, 2025-05-07T20:32:54.3040588Z ) 2025-05-07T20:32:54.3040907Z self = 2025-05-07T20:32:54.3041404Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3041673Z 2025-05-07T20:32:54.3041765Z @given( 2025-05-07T20:32:54.3050399Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3050728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3051047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3051388Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3051714Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3052005Z ) 2025-05-07T20:32:54.3052362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3052805Z def test_silu_mul_quant( 2025-05-07T20:32:54.3053058Z self, 2025-05-07T20:32:54.3053262Z T: int, 2025-05-07T20:32:54.3053459Z D: int, 2025-05-07T20:32:54.3053801Z scale_ub: Optional[float], 2025-05-07T20:32:54.3054083Z contiguous: bool, 2025-05-07T20:32:54.3054384Z compiled: bool, 2025-05-07T20:32:54.3054621Z ) -> None: 2025-05-07T20:32:54.3054847Z torch.manual_seed(2025) 2025-05-07T20:32:54.3055089Z 2025-05-07T20:32:54.3055376Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3057390Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3059207Z 2025-05-07T20:32:54.3059329Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3059548Z 2025-05-07T20:32:54.3059664Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3060081Z self=, 2025-05-07T20:32:54.3060489Z T=2048, 2025-05-07T20:32:54.3060689Z D=5120, 2025-05-07T20:32:54.3060884Z scale_ub=None, 2025-05-07T20:32:54.3061107Z contiguous=False, 2025-05-07T20:32:54.3061394Z compiled=False, 2025-05-07T20:32:54.3061606Z ) 2025-05-07T20:32:54.3061925Z self = 2025-05-07T20:32:54.3062415Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.3062696Z 2025-05-07T20:32:54.3062777Z @given( 2025-05-07T20:32:54.3063016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3063336Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3063643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3063977Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3064315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3064599Z ) 2025-05-07T20:32:54.3065029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3065477Z def test_silu_mul_quant( 2025-05-07T20:32:54.3065716Z self, 2025-05-07T20:32:54.3065917Z T: int, 2025-05-07T20:32:54.3066125Z D: int, 2025-05-07T20:32:54.3066346Z scale_ub: Optional[float], 2025-05-07T20:32:54.3066620Z contiguous: bool, 2025-05-07T20:32:54.3066866Z compiled: bool, 2025-05-07T20:32:54.3067087Z ) -> None: 2025-05-07T20:32:54.3067307Z torch.manual_seed(2025) 2025-05-07T20:32:54.3067554Z 2025-05-07T20:32:54.3067827Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3070498Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3072318Z 2025-05-07T20:32:54.3072439Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3072657Z 2025-05-07T20:32:54.3072761Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3073174Z self=, 2025-05-07T20:32:54.3073570Z T=4096, 2025-05-07T20:32:54.3073762Z D=7168, 2025-05-07T20:32:54.3073957Z scale_ub=None, 2025-05-07T20:32:54.3074195Z contiguous=True, 2025-05-07T20:32:54.3074414Z compiled=True, 2025-05-07T20:32:54.3074620Z ) 2025-05-07T20:32:54.3074992Z self = 2025-05-07T20:32:54.3075478Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.3075752Z 2025-05-07T20:32:54.3075836Z @given( 2025-05-07T20:32:54.3076073Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3076390Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3076696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3077030Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3077362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3077646Z ) 2025-05-07T20:32:54.3077998Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3078442Z def test_silu_mul_quant( 2025-05-07T20:32:54.3078685Z self, 2025-05-07T20:32:54.3078885Z T: int, 2025-05-07T20:32:54.3079086Z D: int, 2025-05-07T20:32:54.3079301Z scale_ub: Optional[float], 2025-05-07T20:32:54.3079583Z contiguous: bool, 2025-05-07T20:32:54.3079828Z compiled: bool, 2025-05-07T20:32:54.3080061Z ) -> None: 2025-05-07T20:32:54.3080323Z torch.manual_seed(2025) 2025-05-07T20:32:54.3080576Z 2025-05-07T20:32:54.3080851Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3082893Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3084708Z 2025-05-07T20:32:54.3084830Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3085047Z 2025-05-07T20:32:54.3085153Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3085611Z self=, 2025-05-07T20:32:54.3086010Z T=2048, 2025-05-07T20:32:54.3086202Z D=5120, 2025-05-07T20:32:54.3086402Z scale_ub=1200.0, 2025-05-07T20:32:54.3086636Z contiguous=False, 2025-05-07T20:32:54.3086860Z compiled=False, 2025-05-07T20:32:54.3624888Z ) 2025-05-07T20:32:54.3625252Z self = 2025-05-07T20:32:54.3625774Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.3626053Z 2025-05-07T20:32:54.3626134Z @given( 2025-05-07T20:32:54.3626376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3626699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3627006Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3627354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3627693Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3627982Z ) 2025-05-07T20:32:54.3628537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3628985Z def test_silu_mul_quant( 2025-05-07T20:32:54.3629239Z self, 2025-05-07T20:32:54.3629437Z T: int, 2025-05-07T20:32:54.3629646Z D: int, 2025-05-07T20:32:54.3629879Z scale_ub: Optional[float], 2025-05-07T20:32:54.3630156Z contiguous: bool, 2025-05-07T20:32:54.3630406Z compiled: bool, 2025-05-07T20:32:54.3630640Z ) -> None: 2025-05-07T20:32:54.3630858Z torch.manual_seed(2025) 2025-05-07T20:32:54.3631104Z 2025-05-07T20:32:54.3631380Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3633383Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3635286Z 2025-05-07T20:32:54.3635406Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3635626Z 2025-05-07T20:32:54.3635732Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3636147Z self=, 2025-05-07T20:32:54.3636555Z T=4096, 2025-05-07T20:32:54.3636742Z D=7168, 2025-05-07T20:32:54.3636942Z scale_ub=1200.0, 2025-05-07T20:32:54.3637169Z contiguous=True, 2025-05-07T20:32:54.3637393Z compiled=False, 2025-05-07T20:32:54.3637606Z ) 2025-05-07T20:32:54.3637929Z self = 2025-05-07T20:32:54.3638431Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.3638713Z 2025-05-07T20:32:54.3638794Z @given( 2025-05-07T20:32:54.3639033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3639424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3639738Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3640089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3640474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3640761Z ) 2025-05-07T20:32:54.3641117Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3641566Z def test_silu_mul_quant( 2025-05-07T20:32:54.3641809Z self, 2025-05-07T20:32:54.3642050Z T: int, 2025-05-07T20:32:54.3642258Z D: int, 2025-05-07T20:32:54.3642485Z scale_ub: Optional[float], 2025-05-07T20:32:54.3642766Z contiguous: bool, 2025-05-07T20:32:54.3643103Z compiled: bool, 2025-05-07T20:32:54.3643340Z ) -> None: 2025-05-07T20:32:54.3643567Z torch.manual_seed(2025) 2025-05-07T20:32:54.3643809Z 2025-05-07T20:32:54.3644095Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3646103Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3647913Z 2025-05-07T20:32:54.3648043Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3648257Z 2025-05-07T20:32:54.3648369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3648821Z self=, 2025-05-07T20:32:54.3649229Z T=16384, 2025-05-07T20:32:54.3649430Z D=7168, 2025-05-07T20:32:54.3649622Z scale_ub=None, 2025-05-07T20:32:54.3649845Z contiguous=False, 2025-05-07T20:32:54.3650081Z compiled=True, 2025-05-07T20:32:54.3650283Z ) 2025-05-07T20:32:54.3650610Z self = 2025-05-07T20:32:54.3651111Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.3651388Z 2025-05-07T20:32:54.3651469Z @given( 2025-05-07T20:32:54.3651706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3652025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3652392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3652728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3653068Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3653362Z ) 2025-05-07T20:32:54.3653846Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3654301Z def test_silu_mul_quant( 2025-05-07T20:32:54.3654550Z self, 2025-05-07T20:32:54.3654749Z T: int, 2025-05-07T20:32:54.3654957Z D: int, 2025-05-07T20:32:54.3655182Z scale_ub: Optional[float], 2025-05-07T20:32:54.3655456Z contiguous: bool, 2025-05-07T20:32:54.3655708Z compiled: bool, 2025-05-07T20:32:54.3655948Z ) -> None: 2025-05-07T20:32:54.3656172Z torch.manual_seed(2025) 2025-05-07T20:32:54.3656422Z 2025-05-07T20:32:54.3656693Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3658701Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3660567Z 2025-05-07T20:32:54.3660687Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3660910Z 2025-05-07T20:32:54.3661015Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3661432Z self=, 2025-05-07T20:32:54.3661830Z T=4096, 2025-05-07T20:32:54.3662029Z D=7168, 2025-05-07T20:32:54.3662230Z scale_ub=None, 2025-05-07T20:32:54.3662447Z contiguous=True, 2025-05-07T20:32:54.3662682Z compiled=False, 2025-05-07T20:32:54.3662897Z ) 2025-05-07T20:32:54.3663261Z self = 2025-05-07T20:32:54.3663761Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3664035Z 2025-05-07T20:32:54.3664119Z @given( 2025-05-07T20:32:54.3664359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3664672Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3664988Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3665324Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3665652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3665948Z ) 2025-05-07T20:32:54.3666303Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3666739Z def test_silu_mul_quant( 2025-05-07T20:32:54.3666990Z self, 2025-05-07T20:32:54.3667194Z T: int, 2025-05-07T20:32:54.3667389Z D: int, 2025-05-07T20:32:54.3667617Z scale_ub: Optional[float], 2025-05-07T20:32:54.3667892Z contiguous: bool, 2025-05-07T20:32:54.3668179Z compiled: bool, 2025-05-07T20:32:54.3668411Z ) -> None: 2025-05-07T20:32:54.3668633Z torch.manual_seed(2025) 2025-05-07T20:32:54.3668884Z 2025-05-07T20:32:54.3669154Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3671154Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3674023Z 2025-05-07T20:32:54.3674148Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3674363Z 2025-05-07T20:32:54.3674475Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3674884Z self=, 2025-05-07T20:32:54.3675294Z T=16384, 2025-05-07T20:32:54.3675496Z D=7168, 2025-05-07T20:32:54.3675697Z scale_ub=None, 2025-05-07T20:32:54.3675915Z contiguous=True, 2025-05-07T20:32:54.3676145Z compiled=False, 2025-05-07T20:32:54.3676356Z ) 2025-05-07T20:32:54.3676673Z self = 2025-05-07T20:32:54.3677170Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.3677445Z 2025-05-07T20:32:54.3677533Z @given( 2025-05-07T20:32:54.3677764Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3678085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3678400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3678734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3679072Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3679364Z ) 2025-05-07T20:32:54.3679798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3680240Z def test_silu_mul_quant( 2025-05-07T20:32:54.3680486Z self, 2025-05-07T20:32:54.3680689Z T: int, 2025-05-07T20:32:54.3680887Z D: int, 2025-05-07T20:32:54.3681110Z scale_ub: Optional[float], 2025-05-07T20:32:54.3681391Z contiguous: bool, 2025-05-07T20:32:54.3681633Z compiled: bool, 2025-05-07T20:32:54.3681864Z ) -> None: 2025-05-07T20:32:54.3682085Z torch.manual_seed(2025) 2025-05-07T20:32:54.3682324Z 2025-05-07T20:32:54.3682599Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3684642Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3686458Z 2025-05-07T20:32:54.3686578Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.3686790Z 2025-05-07T20:32:54.3686898Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.3687306Z self=, 2025-05-07T20:32:54.3687712Z T=16384, 2025-05-07T20:32:54.3687913Z D=7168, 2025-05-07T20:32:54.3688107Z scale_ub=1200.0, 2025-05-07T20:32:54.3688334Z contiguous=True, 2025-05-07T20:32:54.3688563Z compiled=False, 2025-05-07T20:32:54.3688768Z ) 2025-05-07T20:32:54.3689133Z self = 2025-05-07T20:32:54.3689628Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.3689905Z 2025-05-07T20:32:54.3689991Z @given( 2025-05-07T20:32:54.3690222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.3690540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.3690852Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.3691180Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.3691513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.3691805Z ) 2025-05-07T20:32:54.3692151Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.3692642Z def test_silu_mul_quant( 2025-05-07T20:32:54.3692889Z self, 2025-05-07T20:32:54.3693086Z T: int, 2025-05-07T20:32:54.3693293Z D: int, 2025-05-07T20:32:54.3693519Z scale_ub: Optional[float], 2025-05-07T20:32:54.3693875Z contiguous: bool, 2025-05-07T20:32:54.3694119Z compiled: bool, 2025-05-07T20:32:54.3694351Z ) -> None: 2025-05-07T20:32:54.3694572Z torch.manual_seed(2025) 2025-05-07T20:32:54.3694812Z 2025-05-07T20:32:54.3695089Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.3697090Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.3699150Z 2025-05-07T20:32:54.3699284Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5509223Z 2025-05-07T20:32:54.5509558Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5510383Z self=, 2025-05-07T20:32:54.5510806Z T=128, 2025-05-07T20:32:54.5511008Z D=5120, 2025-05-07T20:32:54.5511205Z scale_ub=1200.0, 2025-05-07T20:32:54.5511440Z contiguous=False, 2025-05-07T20:32:54.5511674Z compiled=False, 2025-05-07T20:32:54.5511885Z ) 2025-05-07T20:32:54.5512215Z self = 2025-05-07T20:32:54.5512720Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.5512998Z 2025-05-07T20:32:54.5513088Z @given( 2025-05-07T20:32:54.5513330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5513748Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5514063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5514409Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5514764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5515072Z ) 2025-05-07T20:32:54.5515461Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5515900Z def test_silu_mul_quant( 2025-05-07T20:32:54.5516147Z self, 2025-05-07T20:32:54.5516347Z T: int, 2025-05-07T20:32:54.5516544Z D: int, 2025-05-07T20:32:54.5516770Z scale_ub: Optional[float], 2025-05-07T20:32:54.5517047Z contiguous: bool, 2025-05-07T20:32:54.5517291Z compiled: bool, 2025-05-07T20:32:54.5517523Z ) -> None: 2025-05-07T20:32:54.5517747Z torch.manual_seed(2025) 2025-05-07T20:32:54.5518001Z 2025-05-07T20:32:54.5518275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5518630Z 2025-05-07T20:32:54.5518831Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5519216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5519540Z x = x_sign * x_clamp 2025-05-07T20:32:54.5519790Z x0 = x[:, :D] 2025-05-07T20:32:54.5520009Z x1 = x[:, D:] 2025-05-07T20:32:54.5520230Z 2025-05-07T20:32:54.5520426Z if contiguous: 2025-05-07T20:32:54.5520658Z x0 = x0.contiguous() 2025-05-07T20:32:54.5520926Z x1 = x1.contiguous() 2025-05-07T20:32:54.5521172Z 2025-05-07T20:32:54.5521365Z if scale_ub is not None: 2025-05-07T20:32:54.5521648Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5521993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5522311Z ) 2025-05-07T20:32:54.5522610Z else: 2025-05-07T20:32:54.5522830Z scale_ub_tensor = None 2025-05-07T20:32:54.5523092Z 2025-05-07T20:32:54.5523328Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5523654Z op = silu_mul_quant 2025-05-07T20:32:54.5523910Z if compiled: 2025-05-07T20:32:54.5524159Z op = torch.compile(op) 2025-05-07T20:32:54.5524466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5524746Z 2025-05-07T20:32:54.5524939Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5525115Z 2025-05-07T20:32:54.5525215Z moe/activation_test.py:117: 2025-05-07T20:32:54.5525527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5525859Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5526146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5526843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5527538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5528075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5528763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5529429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5530014Z kernel = self.compile( 2025-05-07T20:32:54.5530556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5531214Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5531617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5531846Z 2025-05-07T20:32:54.5532054Z self = 2025-05-07T20:32:54.5533168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5534700Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9e951c0>} 2025-05-07T20:32:54.5536038Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5537052Z context = 2025-05-07T20:32:54.5537341Z 2025-05-07T20:32:54.5537508Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5538031Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5538503Z module_map=module_map) 2025-05-07T20:32:54.5538876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5539274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5539559Z E ^ 2025-05-07T20:32:54.5540026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5540472Z 2025-05-07T20:32:54.5540888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.5541403Z 2025-05-07T20:32:54.5541507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5541926Z self=, 2025-05-07T20:32:54.5542336Z T=2048, 2025-05-07T20:32:54.5542526Z D=7168, 2025-05-07T20:32:54.5542729Z scale_ub=None, 2025-05-07T20:32:54.5542958Z contiguous=False, 2025-05-07T20:32:54.5543263Z compiled=False, 2025-05-07T20:32:54.5543479Z ) 2025-05-07T20:32:54.5543812Z self = 2025-05-07T20:32:54.5544304Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.5544587Z 2025-05-07T20:32:54.5544667Z @given( 2025-05-07T20:32:54.5544908Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5545222Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5545540Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5554004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5554353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5554641Z ) 2025-05-07T20:32:54.5554996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5555444Z def test_silu_mul_quant( 2025-05-07T20:32:54.5555687Z self, 2025-05-07T20:32:54.5555896Z T: int, 2025-05-07T20:32:54.5556102Z D: int, 2025-05-07T20:32:54.5556322Z scale_ub: Optional[float], 2025-05-07T20:32:54.5556604Z contiguous: bool, 2025-05-07T20:32:54.5556856Z compiled: bool, 2025-05-07T20:32:54.5557082Z ) -> None: 2025-05-07T20:32:54.5557308Z torch.manual_seed(2025) 2025-05-07T20:32:54.5557559Z 2025-05-07T20:32:54.5557913Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5559936Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.5561757Z 2025-05-07T20:32:54.5561921Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.5562148Z 2025-05-07T20:32:54.5562255Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.5562671Z self=, 2025-05-07T20:32:54.5563072Z T=128, 2025-05-07T20:32:54.5563265Z D=7168, 2025-05-07T20:32:54.5563471Z scale_ub=1200.0, 2025-05-07T20:32:54.5563695Z contiguous=True, 2025-05-07T20:32:54.5563920Z compiled=True, 2025-05-07T20:32:54.5564137Z ) 2025-05-07T20:32:54.5564455Z self = 2025-05-07T20:32:54.5564945Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.5565220Z 2025-05-07T20:32:54.5565302Z @given( 2025-05-07T20:32:54.5565532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.5565847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.5566153Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.5566480Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.5566854Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.5567144Z ) 2025-05-07T20:32:54.5567484Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.5567927Z def test_silu_mul_quant( 2025-05-07T20:32:54.5568169Z self, 2025-05-07T20:32:54.5568366Z T: int, 2025-05-07T20:32:54.5568562Z D: int, 2025-05-07T20:32:54.5568783Z scale_ub: Optional[float], 2025-05-07T20:32:54.5569057Z contiguous: bool, 2025-05-07T20:32:54.5569290Z compiled: bool, 2025-05-07T20:32:54.5569515Z ) -> None: 2025-05-07T20:32:54.5569732Z torch.manual_seed(2025) 2025-05-07T20:32:54.5569968Z 2025-05-07T20:32:54.5570245Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.5570669Z 2025-05-07T20:32:54.5570861Z x_sign = torch.sign(x) 2025-05-07T20:32:54.5571158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.5571473Z x = x_sign * x_clamp 2025-05-07T20:32:54.5571707Z x0 = x[:, :D] 2025-05-07T20:32:54.5571926Z x1 = x[:, D:] 2025-05-07T20:32:54.5572136Z 2025-05-07T20:32:54.5572319Z if contiguous: 2025-05-07T20:32:54.5572553Z x0 = x0.contiguous() 2025-05-07T20:32:54.5572813Z x1 = x1.contiguous() 2025-05-07T20:32:54.5573046Z 2025-05-07T20:32:54.5573242Z if scale_ub is not None: 2025-05-07T20:32:54.5573516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.5573955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.5574260Z ) 2025-05-07T20:32:54.5574453Z else: 2025-05-07T20:32:54.5574665Z scale_ub_tensor = None 2025-05-07T20:32:54.5574914Z 2025-05-07T20:32:54.5575150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.5575467Z op = silu_mul_quant 2025-05-07T20:32:54.5575716Z if compiled: 2025-05-07T20:32:54.5575969Z op = torch.compile(op) 2025-05-07T20:32:54.5576268Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5576539Z 2025-05-07T20:32:54.5576737Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.5576958Z 2025-05-07T20:32:54.5577064Z moe/activation_test.py:117: 2025-05-07T20:32:54.5577362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5577688Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.5577970Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.5578535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.5579089Z return fn(*args, **kwargs) 2025-05-07T20:32:54.5579752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.5580488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.5581027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.5581697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.5582365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.5582897Z kernel = self.compile( 2025-05-07T20:32:54.5583432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.5584086Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.5584486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.5584712Z 2025-05-07T20:32:54.5584927Z self = 2025-05-07T20:32:54.5586033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.5587387Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9de7b00>} 2025-05-07T20:32:54.5588716Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.5589727Z context = 2025-05-07T20:32:54.5590010Z 2025-05-07T20:32:54.5590182Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.5590746Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.5591219Z module_map=module_map) 2025-05-07T20:32:54.5591589Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.5591941Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.5592206Z E ^ 2025-05-07T20:32:54.5592670Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.5593114Z 2025-05-07T20:32:54.5593532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8328937Z 2025-05-07T20:32:54.8329492Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8330325Z self=, 2025-05-07T20:32:54.8330754Z T=128, 2025-05-07T20:32:54.8330951Z D=7168, 2025-05-07T20:32:54.8331173Z scale_ub=1200.0, 2025-05-07T20:32:54.8331403Z contiguous=True, 2025-05-07T20:32:54.8331630Z compiled=False, 2025-05-07T20:32:54.8331845Z ) 2025-05-07T20:32:54.8332174Z self = 2025-05-07T20:32:54.8332675Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.8333226Z 2025-05-07T20:32:54.8333313Z @given( 2025-05-07T20:32:54.8333542Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8333973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8334283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8334608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8334937Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8335228Z ) 2025-05-07T20:32:54.8335573Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8336018Z def test_silu_mul_quant( 2025-05-07T20:32:54.8336264Z self, 2025-05-07T20:32:54.8336462Z T: int, 2025-05-07T20:32:54.8336657Z D: int, 2025-05-07T20:32:54.8336977Z scale_ub: Optional[float], 2025-05-07T20:32:54.8337257Z contiguous: bool, 2025-05-07T20:32:54.8337496Z compiled: bool, 2025-05-07T20:32:54.8337730Z ) -> None: 2025-05-07T20:32:54.8337958Z torch.manual_seed(2025) 2025-05-07T20:32:54.8338199Z 2025-05-07T20:32:54.8338477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8338826Z 2025-05-07T20:32:54.8339018Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8339316Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8341368Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8343204Z 2025-05-07T20:32:54.8343326Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.8343541Z 2025-05-07T20:32:54.8343653Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8344062Z self=, 2025-05-07T20:32:54.8344469Z T=128, 2025-05-07T20:32:54.8344663Z D=5120, 2025-05-07T20:32:54.8344859Z scale_ub=1200.0, 2025-05-07T20:32:54.8345087Z contiguous=True, 2025-05-07T20:32:54.8345315Z compiled=True, 2025-05-07T20:32:54.8345519Z ) 2025-05-07T20:32:54.8345837Z self = 2025-05-07T20:32:54.8346398Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.8346660Z 2025-05-07T20:32:54.8346746Z @given( 2025-05-07T20:32:54.8346978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8347296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8347604Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8347931Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8348263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8348552Z ) 2025-05-07T20:32:54.8348895Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8349338Z def test_silu_mul_quant( 2025-05-07T20:32:54.8349582Z self, 2025-05-07T20:32:54.8349776Z T: int, 2025-05-07T20:32:54.8349977Z D: int, 2025-05-07T20:32:54.8350201Z scale_ub: Optional[float], 2025-05-07T20:32:54.8350472Z contiguous: bool, 2025-05-07T20:32:54.8350718Z compiled: bool, 2025-05-07T20:32:54.8350945Z ) -> None: 2025-05-07T20:32:54.8351165Z torch.manual_seed(2025) 2025-05-07T20:32:54.8351407Z 2025-05-07T20:32:54.8351683Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8352029Z 2025-05-07T20:32:54.8352223Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8352518Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8354517Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8356324Z 2025-05-07T20:32:54.8356452Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.8356664Z 2025-05-07T20:32:54.8356810Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8357241Z self=, 2025-05-07T20:32:54.8357640Z T=128, 2025-05-07T20:32:54.8357839Z D=7168, 2025-05-07T20:32:54.8358038Z scale_ub=None, 2025-05-07T20:32:54.8358252Z contiguous=True, 2025-05-07T20:32:54.8358480Z compiled=True, 2025-05-07T20:32:54.8358688Z ) 2025-05-07T20:32:54.8359012Z self = 2025-05-07T20:32:54.8359492Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8359852Z 2025-05-07T20:32:54.8359961Z @given( 2025-05-07T20:32:54.8360253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8360598Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8360943Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8361328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8361711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8362101Z ) 2025-05-07T20:32:54.8362495Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8363013Z def test_silu_mul_quant( 2025-05-07T20:32:54.8363256Z self, 2025-05-07T20:32:54.8363498Z T: int, 2025-05-07T20:32:54.8363700Z D: int, 2025-05-07T20:32:54.8363959Z scale_ub: Optional[float], 2025-05-07T20:32:54.8364247Z contiguous: bool, 2025-05-07T20:32:54.8364531Z compiled: bool, 2025-05-07T20:32:54.8364791Z ) -> None: 2025-05-07T20:32:54.8365016Z torch.manual_seed(2025) 2025-05-07T20:32:54.8365274Z 2025-05-07T20:32:54.8365569Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8367892Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8370098Z 2025-05-07T20:32:54.8370225Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.8370489Z 2025-05-07T20:32:54.8370910Z FAILED 2025-05-07T20:32:54.8371046Z 2025-05-07T20:32:54.8371182Z =================================== FAILURES =================================== 2025-05-07T20:32:54.8371649Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:54.8372143Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:54.8372821Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:54.8373421Z | yield 2025-05-07T20:32:54.8374034Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run 2025-05-07T20:32:54.8374686Z | self._callTestMethod(testMethod) 2025-05-07T20:32:54.8375029Z | ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^ 2025-05-07T20:32:54.8375607Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod 2025-05-07T20:32:54.8376205Z | if method() is not None: 2025-05-07T20:32:54.8376507Z | ~~~~~~^^ 2025-05-07T20:32:54.8377178Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:54.8378002Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8378351Z | ^^^^^^^ 2025-05-07T20:32:54.8379018Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:54.8379710Z | raise the_error_hypothesis_found 2025-05-07T20:32:54.8380196Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:54.8380713Z +-+---------------- 1 ---------------- 2025-05-07T20:32:54.8381010Z | Traceback (most recent call last): 2025-05-07T20:32:54.8381872Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.8383105Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8386813Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8390190Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.8390869Z | self=, 2025-05-07T20:32:54.8391525Z | T=2048, 2025-05-07T20:32:54.8391919Z | D=5120, # or any other generated value 2025-05-07T20:32:54.8392425Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.8392973Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.8393525Z | compiled=False, # or any other generated value 2025-05-07T20:32:54.8394070Z | ) 2025-05-07T20:32:54.8394319Z | 2025-05-07T20:32:54.8395114Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:32:54.8395931Z +---------------- 2 ---------------- 2025-05-07T20:32:54.8396217Z | Traceback (most recent call last): 2025-05-07T20:32:54.8396923Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.8397696Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8400003Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8401959Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.8402390Z | self=, 2025-05-07T20:32:54.8402925Z | T=128, 2025-05-07T20:32:54.8403133Z | D=7168, 2025-05-07T20:32:54.8403350Z | scale_ub=None, 2025-05-07T20:32:54.8403579Z | contiguous=True, 2025-05-07T20:32:54.8403820Z | compiled=True, 2025-05-07T20:32:54.8404041Z | ) 2025-05-07T20:32:54.8404212Z | 2025-05-07T20:32:54.8404732Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.8405335Z +---------------- 3 ---------------- 2025-05-07T20:32:54.8405619Z | Traceback (most recent call last): 2025-05-07T20:32:54.8406530Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:54.8407319Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8409320Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.8411300Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.8411732Z | self=, 2025-05-07T20:32:54.8412145Z | T=128, 2025-05-07T20:32:54.8412349Z | D=5120, 2025-05-07T20:32:54.8412552Z | scale_ub=1200.0, 2025-05-07T20:32:54.8412873Z | contiguous=True, 2025-05-07T20:32:54.8413118Z | compiled=True, 2025-05-07T20:32:54.8413336Z | ) 2025-05-07T20:32:54.8413521Z | 2025-05-07T20:32:54.8414168Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.8414764Z +---------------- 4 ---------------- 2025-05-07T20:32:54.8415058Z | Traceback (most recent call last): 2025-05-07T20:32:54.8415761Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:54.8416467Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8416825Z | ~~~~~~^^ 2025-05-07T20:32:54.8417466Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:54.8418167Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8418992Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:54.8419774Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8420060Z | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^ 2025-05-07T20:32:54.8420324Z | a, 2025-05-07T20:32:54.8420521Z | ^^ 2025-05-07T20:32:54.8420723Z | ...<23 lines>... 2025-05-07T20:32:54.8420963Z | USE_INT64=use_int64, 2025-05-07T20:32:54.8421215Z | ^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8421459Z | ) 2025-05-07T20:32:54.8421644Z | ^ 2025-05-07T20:32:54.8422161Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:54.8422883Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8423326Z | ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8424024Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:54.8424785Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8425250Z | ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8425883Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:54.8426570Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8426972Z | ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:54.8427887Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:54.8428687Z | fn() 2025-05-07T20:32:54.8428963Z | ~~^^ 2025-05-07T20:32:54.8429746Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:54.8430663Z | self.fn.run( 2025-05-07T20:32:54.8430970Z | ~~~~~~~~~~~^ 2025-05-07T20:32:54.8431265Z | *args, 2025-05-07T20:32:54.8431563Z | ^^^^^^ 2025-05-07T20:32:54.8431860Z | **current, 2025-05-07T20:32:54.8432170Z | ^^^^^^^^^^ 2025-05-07T20:32:54.8432480Z | ) 2025-05-07T20:32:54.8432741Z | ^ 2025-05-07T20:32:54.8433417Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:54.8434224Z | kernel = self.compile( 2025-05-07T20:32:54.8434591Z | src, 2025-05-07T20:32:54.8434905Z | target=target, 2025-05-07T20:32:54.8435350Z | options=options.__dict__, 2025-05-07T20:32:54.8435736Z | ) 2025-05-07T20:32:54.8436486Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:54.8437469Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8438479Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:54.8439566Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8440249Z | module_map=module_map) 2025-05-07T20:32:54.8440856Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8441343Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8441714Z | ^ 2025-05-07T20:32:54.8442355Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8443150Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:54.8443735Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:54.8444435Z | self=, 2025-05-07T20:32:54.8445041Z | T=1, # or any other generated value 2025-05-07T20:32:54.8467552Z | D=5120, # or any other generated value 2025-05-07T20:32:54.8468104Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:54.8468595Z | contiguous=True, # or any other generated value 2025-05-07T20:32:54.8469095Z | compiled=True, # or any other generated value 2025-05-07T20:32:54.8469516Z | ) 2025-05-07T20:32:54.8469761Z | 2025-05-07T20:32:54.8470566Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:54.8471431Z +------------------------------------ 2025-05-07T20:32:54.8471917Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:54.8472596Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8473147Z self=, 2025-05-07T20:32:54.8473682Z T=1, 2025-05-07T20:32:54.8473921Z D=5120, 2025-05-07T20:32:54.8474177Z scale_ub=None, 2025-05-07T20:32:54.8474464Z contiguous=True, 2025-05-07T20:32:54.8474754Z compiled=True, 2025-05-07T20:32:54.8475036Z ) 2025-05-07T20:32:54.8475475Z self = 2025-05-07T20:32:54.8476091Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8476449Z 2025-05-07T20:32:54.8476557Z @given( 2025-05-07T20:32:54.8476951Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8477377Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8477805Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8478265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8478723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8479114Z ) 2025-05-07T20:32:54.8479595Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8480201Z def test_silu_mul_quant( 2025-05-07T20:32:54.8480529Z self, 2025-05-07T20:32:54.8480811Z T: int, 2025-05-07T20:32:54.8481091Z D: int, 2025-05-07T20:32:54.8481389Z scale_ub: Optional[float], 2025-05-07T20:32:54.8481773Z contiguous: bool, 2025-05-07T20:32:54.8482101Z compiled: bool, 2025-05-07T20:32:54.8482391Z ) -> None: 2025-05-07T20:32:54.8482680Z torch.manual_seed(2025) 2025-05-07T20:32:54.8483010Z 2025-05-07T20:32:54.8483372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8483919Z 2025-05-07T20:32:54.8484175Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8484567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8484999Z x = x_sign * x_clamp 2025-05-07T20:32:54.8485326Z x0 = x[:, :D] 2025-05-07T20:32:54.8485630Z x1 = x[:, D:] 2025-05-07T20:32:54.8485913Z 2025-05-07T20:32:54.8486170Z if contiguous: 2025-05-07T20:32:54.8486494Z x0 = x0.contiguous() 2025-05-07T20:32:54.8486847Z x1 = x1.contiguous() 2025-05-07T20:32:54.8487172Z 2025-05-07T20:32:54.8487441Z if scale_ub is not None: 2025-05-07T20:32:54.8487818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8488267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8488751Z ) 2025-05-07T20:32:54.8489018Z else: 2025-05-07T20:32:54.8489301Z scale_ub_tensor = None 2025-05-07T20:32:54.8489646Z 2025-05-07T20:32:54.8489964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8490387Z op = silu_mul_quant 2025-05-07T20:32:54.8490730Z if compiled: 2025-05-07T20:32:54.8491070Z op = torch.compile(op) 2025-05-07T20:32:54.8491467Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8491844Z 2025-05-07T20:32:54.8492109Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8492490Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8492884Z 2025-05-07T20:32:54.8493206Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8493787Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8494183Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8494612Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8495106Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8495522Z 2025-05-07T20:32:54.8495804Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8496078Z 2025-05-07T20:32:54.8496223Z moe/activation_test.py:126: 2025-05-07T20:32:54.8496629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8497141Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8497598Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8499216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8500255Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8501059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8502012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8503207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8504192Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8505173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8506044Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8506854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8507558Z fn() 2025-05-07T20:32:54.8508246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8509039Z self.fn.run( 2025-05-07T20:32:54.8509674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8510398Z kernel = self.compile( 2025-05-07T20:32:54.8511227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8512112Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8512651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8512980Z 2025-05-07T20:32:54.8513268Z self = 2025-05-07T20:32:54.8514801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8516693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c614a700>} 2025-05-07T20:32:54.8518613Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8519993Z context = 2025-05-07T20:32:54.8520384Z 2025-05-07T20:32:54.8520619Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8521349Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8521983Z module_map=module_map) 2025-05-07T20:32:54.8522475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8522960Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8523327Z E ^ 2025-05-07T20:32:54.8523956Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8524573Z 2025-05-07T20:32:54.8525159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8525883Z 2025-05-07T20:32:54.8526034Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8526609Z self=, 2025-05-07T20:32:54.8527223Z T=2048, 2025-05-07T20:32:54.8527468Z D=5120, 2025-05-07T20:32:54.8527713Z scale_ub=1200.0, 2025-05-07T20:32:54.8528005Z contiguous=True, 2025-05-07T20:32:54.8528304Z compiled=False, 2025-05-07T20:32:54.8528576Z ) 2025-05-07T20:32:54.8528981Z self = 2025-05-07T20:32:54.8529634Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.8529986Z 2025-05-07T20:32:54.8530097Z @given( 2025-05-07T20:32:54.8530386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8530793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8531240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8531656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8531810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8531904Z ) 2025-05-07T20:32:54.8532218Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8532345Z def test_silu_mul_quant( 2025-05-07T20:32:54.8532446Z self, 2025-05-07T20:32:54.8532550Z T: int, 2025-05-07T20:32:54.8532656Z D: int, 2025-05-07T20:32:54.8532780Z scale_ub: Optional[float], 2025-05-07T20:32:54.8532905Z contiguous: bool, 2025-05-07T20:32:54.8533017Z compiled: bool, 2025-05-07T20:32:54.8533124Z ) -> None: 2025-05-07T20:32:54.8533256Z torch.manual_seed(2025) 2025-05-07T20:32:54.8533350Z 2025-05-07T20:32:54.8533570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8533797Z 2025-05-07T20:32:54.8533926Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8534144Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8534270Z x = x_sign * x_clamp 2025-05-07T20:32:54.8534384Z x0 = x[:, :D] 2025-05-07T20:32:54.8534489Z x1 = x[:, D:] 2025-05-07T20:32:54.8534597Z 2025-05-07T20:32:54.8534705Z if contiguous: 2025-05-07T20:32:54.8534839Z x0 = x0.contiguous() 2025-05-07T20:32:54.8534955Z x1 = x1.contiguous() 2025-05-07T20:32:54.8535048Z 2025-05-07T20:32:54.8535173Z if scale_ub is not None: 2025-05-07T20:32:54.8535316Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8535499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8535612Z ) 2025-05-07T20:32:54.8535723Z else: 2025-05-07T20:32:54.8535853Z scale_ub_tensor = None 2025-05-07T20:32:54.8536012Z 2025-05-07T20:32:54.8536188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8536315Z op = silu_mul_quant 2025-05-07T20:32:54.8536440Z if compiled: 2025-05-07T20:32:54.8536581Z op = torch.compile(op) 2025-05-07T20:32:54.8536734Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8536840Z 2025-05-07T20:32:54.8536967Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8536973Z 2025-05-07T20:32:54.8537115Z moe/activation_test.py:117: 2025-05-07T20:32:54.8537296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8537436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8537583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8538267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8538411Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8538922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8539234Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8539707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8539898Z kernel = self.compile( 2025-05-07T20:32:54.8540478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8540726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8540905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8540912Z 2025-05-07T20:32:54.8541207Z self = 2025-05-07T20:32:54.8542327Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8543029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c5ffa020>} 2025-05-07T20:32:54.8544049Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8544309Z context = 2025-05-07T20:32:54.8544316Z 2025-05-07T20:32:54.8544548Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8544904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8545056Z module_map=module_map) 2025-05-07T20:32:54.8545280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8545420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8545534Z E ^ 2025-05-07T20:32:54.8546071Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8546081Z 2025-05-07T20:32:54.8546638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8546644Z 2025-05-07T20:32:54.8546793Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8547070Z self=, 2025-05-07T20:32:54.8547179Z T=2048, 2025-05-07T20:32:54.8547283Z D=5120, 2025-05-07T20:32:54.8547395Z scale_ub=1200.0, 2025-05-07T20:32:54.8547512Z contiguous=True, 2025-05-07T20:32:54.8547619Z compiled=True, 2025-05-07T20:32:54.8547764Z ) 2025-05-07T20:32:54.8548062Z self = 2025-05-07T20:32:54.8548298Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.8548308Z 2025-05-07T20:32:54.8548414Z @given( 2025-05-07T20:32:54.8548584Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8548722Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8548877Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8549044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8549199Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8549311Z ) 2025-05-07T20:32:54.8549645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8549778Z def test_silu_mul_quant( 2025-05-07T20:32:54.8549892Z self, 2025-05-07T20:32:54.8550001Z T: int, 2025-05-07T20:32:54.8550106Z D: int, 2025-05-07T20:32:54.8550256Z scale_ub: Optional[float], 2025-05-07T20:32:54.8550385Z contiguous: bool, 2025-05-07T20:32:54.8550507Z compiled: bool, 2025-05-07T20:32:54.8550628Z ) -> None: 2025-05-07T20:32:54.8550769Z torch.manual_seed(2025) 2025-05-07T20:32:54.8550870Z 2025-05-07T20:32:54.8551108Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8551268Z 2025-05-07T20:32:54.8551401Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8551572Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8551696Z x = x_sign * x_clamp 2025-05-07T20:32:54.8551819Z x0 = x[:, :D] 2025-05-07T20:32:54.8551932Z x1 = x[:, D:] 2025-05-07T20:32:54.8552033Z 2025-05-07T20:32:54.8552161Z if contiguous: 2025-05-07T20:32:54.8552285Z x0 = x0.contiguous() 2025-05-07T20:32:54.8552408Z x1 = x1.contiguous() 2025-05-07T20:32:54.8552524Z 2025-05-07T20:32:54.8552658Z if scale_ub is not None: 2025-05-07T20:32:54.8552803Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8553045Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8553156Z ) 2025-05-07T20:32:54.8553270Z else: 2025-05-07T20:32:54.8553403Z scale_ub_tensor = None 2025-05-07T20:32:54.8553502Z 2025-05-07T20:32:54.8553691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8553819Z op = silu_mul_quant 2025-05-07T20:32:54.8553936Z if compiled: 2025-05-07T20:32:54.8554082Z op = torch.compile(op) 2025-05-07T20:32:54.8554229Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8554329Z 2025-05-07T20:32:54.8554470Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8554642Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8554745Z 2025-05-07T20:32:54.8554944Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8555091Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8555237Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8555414Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8555664Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8555789Z 2025-05-07T20:32:54.8555929Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8555939Z 2025-05-07T20:32:54.8556077Z moe/activation_test.py:126: 2025-05-07T20:32:54.8556263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8556408Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8556590Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8557358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8557498Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8558072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8558400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8558904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8559257Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8559779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8560005Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8560475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8560583Z fn() 2025-05-07T20:32:54.8561112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8561236Z self.fn.run( 2025-05-07T20:32:54.8561703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8561832Z kernel = self.compile( 2025-05-07T20:32:54.8562367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8562686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8562869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8562876Z 2025-05-07T20:32:54.8563151Z self = 2025-05-07T20:32:54.8564189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8564939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4eeaac0>} 2025-05-07T20:32:54.8565913Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8566179Z context = 2025-05-07T20:32:54.8566185Z 2025-05-07T20:32:54.8566400Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8566743Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8566903Z module_map=module_map) 2025-05-07T20:32:54.8567115Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8567251Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8567343Z E ^ 2025-05-07T20:32:54.8567823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8567950Z 2025-05-07T20:32:54.8568539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8568549Z 2025-05-07T20:32:54.8568690Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8569000Z self=, 2025-05-07T20:32:54.8569106Z T=16384, 2025-05-07T20:32:54.8569210Z D=7168, 2025-05-07T20:32:54.8569334Z scale_ub=1200.0, 2025-05-07T20:32:54.8569454Z contiguous=False, 2025-05-07T20:32:54.8569570Z compiled=False, 2025-05-07T20:32:54.8569680Z ) 2025-05-07T20:32:54.8569974Z self = 2025-05-07T20:32:54.8570296Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.8570305Z 2025-05-07T20:32:54.8570426Z @given( 2025-05-07T20:32:54.8570572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8570682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8570795Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8570912Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8571028Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8571100Z ) 2025-05-07T20:32:54.8571343Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8571441Z def test_silu_mul_quant( 2025-05-07T20:32:54.8571516Z self, 2025-05-07T20:32:54.8571589Z T: int, 2025-05-07T20:32:54.8571668Z D: int, 2025-05-07T20:32:54.8571764Z scale_ub: Optional[float], 2025-05-07T20:32:54.8571850Z contiguous: bool, 2025-05-07T20:32:54.8571943Z compiled: bool, 2025-05-07T20:32:54.8572021Z ) -> None: 2025-05-07T20:32:54.8572121Z torch.manual_seed(2025) 2025-05-07T20:32:54.8572195Z 2025-05-07T20:32:54.8572361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8572442Z 2025-05-07T20:32:54.8572531Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8572653Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8572808Z x = x_sign * x_clamp 2025-05-07T20:32:54.8572886Z x0 = x[:, :D] 2025-05-07T20:32:54.8572962Z x1 = x[:, D:] 2025-05-07T20:32:54.8573040Z 2025-05-07T20:32:54.8573121Z if contiguous: 2025-05-07T20:32:54.8573210Z x0 = x0.contiguous() 2025-05-07T20:32:54.8573305Z x1 = x1.contiguous() 2025-05-07T20:32:54.8573375Z 2025-05-07T20:32:54.8573463Z if scale_ub is not None: 2025-05-07T20:32:54.8573578Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8573910Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8573992Z ) 2025-05-07T20:32:54.8574068Z else: 2025-05-07T20:32:54.8574207Z scale_ub_tensor = None 2025-05-07T20:32:54.8574290Z 2025-05-07T20:32:54.8574422Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8574508Z op = silu_mul_quant 2025-05-07T20:32:54.8574598Z if compiled: 2025-05-07T20:32:54.8574698Z op = torch.compile(op) 2025-05-07T20:32:54.8574801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8574880Z 2025-05-07T20:32:54.8574967Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8574972Z 2025-05-07T20:32:54.8575075Z moe/activation_test.py:117: 2025-05-07T20:32:54.8575201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8575302Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8575408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8575905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8576009Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8576418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8576639Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8576988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8577080Z kernel = self.compile( 2025-05-07T20:32:54.8577465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8577649Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8577772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8577822Z 2025-05-07T20:32:54.8578027Z self = 2025-05-07T20:32:54.8578816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8579320Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c5ffa980>} 2025-05-07T20:32:54.8580064Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8580253Z context = 2025-05-07T20:32:54.8580257Z 2025-05-07T20:32:54.8580427Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8580691Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8580799Z module_map=module_map) 2025-05-07T20:32:54.8580969Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8581066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8581181Z E ^ 2025-05-07T20:32:54.8581540Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8581545Z 2025-05-07T20:32:54.8581951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8581955Z 2025-05-07T20:32:54.8582065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8582283Z self=, 2025-05-07T20:32:54.8582360Z T=1, 2025-05-07T20:32:54.8582448Z D=7168, 2025-05-07T20:32:54.8582528Z scale_ub=None, 2025-05-07T20:32:54.8582611Z contiguous=True, 2025-05-07T20:32:54.8582736Z compiled=True, 2025-05-07T20:32:54.8582809Z ) 2025-05-07T20:32:54.8583033Z self = 2025-05-07T20:32:54.8583191Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8583198Z 2025-05-07T20:32:54.8583272Z @given( 2025-05-07T20:32:54.8583402Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8583500Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8583611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8583732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8583842Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8583912Z ) 2025-05-07T20:32:54.8584161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8584256Z def test_silu_mul_quant( 2025-05-07T20:32:54.8584333Z self, 2025-05-07T20:32:54.8584409Z T: int, 2025-05-07T20:32:54.8584483Z D: int, 2025-05-07T20:32:54.8584629Z scale_ub: Optional[float], 2025-05-07T20:32:54.8584719Z contiguous: bool, 2025-05-07T20:32:54.8584801Z compiled: bool, 2025-05-07T20:32:54.8584882Z ) -> None: 2025-05-07T20:32:54.8584978Z torch.manual_seed(2025) 2025-05-07T20:32:54.8585046Z 2025-05-07T20:32:54.8585218Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8585288Z 2025-05-07T20:32:54.8585376Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8585504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8585593Z x = x_sign * x_clamp 2025-05-07T20:32:54.8585674Z x0 = x[:, :D] 2025-05-07T20:32:54.8585751Z x1 = x[:, D:] 2025-05-07T20:32:54.8585821Z 2025-05-07T20:32:54.8585907Z if contiguous: 2025-05-07T20:32:54.8586038Z x0 = x0.contiguous() 2025-05-07T20:32:54.8586124Z x1 = x1.contiguous() 2025-05-07T20:32:54.8586201Z 2025-05-07T20:32:54.8586292Z if scale_ub is not None: 2025-05-07T20:32:54.8586397Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8586536Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8586612Z ) 2025-05-07T20:32:54.8586684Z else: 2025-05-07T20:32:54.8586780Z scale_ub_tensor = None 2025-05-07T20:32:54.8586853Z 2025-05-07T20:32:54.8586990Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8587077Z op = silu_mul_quant 2025-05-07T20:32:54.8587159Z if compiled: 2025-05-07T20:32:54.8587260Z op = torch.compile(op) 2025-05-07T20:32:54.8587363Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8587435Z 2025-05-07T20:32:54.8587531Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8587652Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8587720Z 2025-05-07T20:32:54.8587863Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8587965Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8588065Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8588189Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8588378Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8588452Z 2025-05-07T20:32:54.8588549Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8588553Z 2025-05-07T20:32:54.8588649Z moe/activation_test.py:126: 2025-05-07T20:32:54.8588782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8588884Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8589014Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8589571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8589715Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8590080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8590301Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8590667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8590924Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8591292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8591468Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8591805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8591885Z fn() 2025-05-07T20:32:54.8592326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8592413Z self.fn.run( 2025-05-07T20:32:54.8592748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8592855Z kernel = self.compile( 2025-05-07T20:32:54.8593233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8593411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8593542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8593547Z 2025-05-07T20:32:54.8593749Z self = 2025-05-07T20:32:54.8594524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8595116Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c519e5c0>} 2025-05-07T20:32:54.8595875Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8596065Z context = 2025-05-07T20:32:54.8596069Z 2025-05-07T20:32:54.8596233Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8596504Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8596614Z module_map=module_map) 2025-05-07T20:32:54.8596788Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8596894Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8596975Z E ^ 2025-05-07T20:32:54.8597331Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8597378Z 2025-05-07T20:32:54.8597782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8597787Z 2025-05-07T20:32:54.8597896Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8598116Z self=, 2025-05-07T20:32:54.8598501Z T=4096, 2025-05-07T20:32:54.8598629Z D=5120, 2025-05-07T20:32:54.8598715Z scale_ub=None, 2025-05-07T20:32:54.8598802Z contiguous=False, 2025-05-07T20:32:54.8598899Z compiled=False, 2025-05-07T20:32:54.8598972Z ) 2025-05-07T20:32:54.8599191Z self = 2025-05-07T20:32:54.8599523Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.8599530Z 2025-05-07T20:32:54.8599606Z @given( 2025-05-07T20:32:54.8599730Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8599831Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8599944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8600063Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8600174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8600248Z ) 2025-05-07T20:32:54.8600502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8600618Z def test_silu_mul_quant( 2025-05-07T20:32:54.8600696Z self, 2025-05-07T20:32:54.8600797Z T: int, 2025-05-07T20:32:54.8600877Z D: int, 2025-05-07T20:32:54.8600976Z scale_ub: Optional[float], 2025-05-07T20:32:54.8601068Z contiguous: bool, 2025-05-07T20:32:54.8601156Z compiled: bool, 2025-05-07T20:32:54.8601311Z ) -> None: 2025-05-07T20:32:54.8601405Z torch.manual_seed(2025) 2025-05-07T20:32:54.8601473Z 2025-05-07T20:32:54.8601644Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8601719Z 2025-05-07T20:32:54.8601809Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8601937Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8602023Z x = x_sign * x_clamp 2025-05-07T20:32:54.8602102Z x0 = x[:, :D] 2025-05-07T20:32:54.8602186Z x1 = x[:, D:] 2025-05-07T20:32:54.8602256Z 2025-05-07T20:32:54.8602336Z if contiguous: 2025-05-07T20:32:54.8602429Z x0 = x0.contiguous() 2025-05-07T20:32:54.8602514Z x1 = x1.contiguous() 2025-05-07T20:32:54.8602660Z 2025-05-07T20:32:54.8602749Z if scale_ub is not None: 2025-05-07T20:32:54.8602851Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8602993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8603071Z ) 2025-05-07T20:32:54.8603146Z else: 2025-05-07T20:32:54.8603243Z scale_ub_tensor = None 2025-05-07T20:32:54.8603314Z 2025-05-07T20:32:54.8603441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8603535Z op = silu_mul_quant 2025-05-07T20:32:54.8603618Z if compiled: 2025-05-07T20:32:54.8603715Z op = torch.compile(op) 2025-05-07T20:32:54.8603832Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8603905Z 2025-05-07T20:32:54.8603999Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8604003Z 2025-05-07T20:32:54.8604096Z moe/activation_test.py:117: 2025-05-07T20:32:54.8604223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8604329Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8604427Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8604922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8605030Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8605460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8605683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8606019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8606114Z kernel = self.compile( 2025-05-07T20:32:54.8606500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8606672Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8606841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8606852Z 2025-05-07T20:32:54.8607057Z self = 2025-05-07T20:32:54.8607824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8608330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c51be3e0>} 2025-05-07T20:32:54.8609066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8609262Z context = 2025-05-07T20:32:54.8609267Z 2025-05-07T20:32:54.8609430Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8609730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8609845Z module_map=module_map) 2025-05-07T20:32:54.8610006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8610102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8610187Z E ^ 2025-05-07T20:32:54.8610583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8610589Z 2025-05-07T20:32:54.8611004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8611009Z 2025-05-07T20:32:54.8611109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8611371Z self=, 2025-05-07T20:32:54.8611458Z T=4096, 2025-05-07T20:32:54.8611533Z D=7168, 2025-05-07T20:32:54.8611627Z scale_ub=None, 2025-05-07T20:32:54.8611711Z contiguous=False, 2025-05-07T20:32:54.8611790Z compiled=False, 2025-05-07T20:32:54.8611870Z ) 2025-05-07T20:32:54.8612086Z self = 2025-05-07T20:32:54.8612254Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.8612258Z 2025-05-07T20:32:54.8612340Z @given( 2025-05-07T20:32:54.8612458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8612555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8612674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8612792Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8612920Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8612994Z ) 2025-05-07T20:32:54.8613240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8613341Z def test_silu_mul_quant( 2025-05-07T20:32:54.8613420Z self, 2025-05-07T20:32:54.8613494Z T: int, 2025-05-07T20:32:54.8613575Z D: int, 2025-05-07T20:32:54.8613896Z scale_ub: Optional[float], 2025-05-07T20:32:54.8613984Z contiguous: bool, 2025-05-07T20:32:54.8614078Z compiled: bool, 2025-05-07T20:32:54.8614155Z ) -> None: 2025-05-07T20:32:54.8614249Z torch.manual_seed(2025) 2025-05-07T20:32:54.8614330Z 2025-05-07T20:32:54.8614496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8614573Z 2025-05-07T20:32:54.8614662Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8614784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8614876Z x = x_sign * x_clamp 2025-05-07T20:32:54.8614956Z x0 = x[:, :D] 2025-05-07T20:32:54.8615033Z x1 = x[:, D:] 2025-05-07T20:32:54.8615109Z 2025-05-07T20:32:54.8615239Z if contiguous: 2025-05-07T20:32:54.8615327Z x0 = x0.contiguous() 2025-05-07T20:32:54.8615420Z x1 = x1.contiguous() 2025-05-07T20:32:54.8615491Z 2025-05-07T20:32:54.8615581Z if scale_ub is not None: 2025-05-07T20:32:54.8615697Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8615828Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8615904Z ) 2025-05-07T20:32:54.8615984Z else: 2025-05-07T20:32:54.8616077Z scale_ub_tensor = None 2025-05-07T20:32:54.8616154Z 2025-05-07T20:32:54.8616280Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8616367Z op = silu_mul_quant 2025-05-07T20:32:54.8616455Z if compiled: 2025-05-07T20:32:54.8616553Z op = torch.compile(op) 2025-05-07T20:32:54.8616659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8616737Z 2025-05-07T20:32:54.8616826Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8616831Z 2025-05-07T20:32:54.8616970Z moe/activation_test.py:117: 2025-05-07T20:32:54.8617105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8617205Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8617309Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8617796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8617891Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8618250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8618469Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8618841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8618938Z kernel = self.compile( 2025-05-07T20:32:54.8619322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8619497Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8619622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8619627Z 2025-05-07T20:32:54.8619828Z self = 2025-05-07T20:32:54.8620595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8621088Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c51bede0>} 2025-05-07T20:32:54.8621832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8622020Z context = 2025-05-07T20:32:54.8622065Z 2025-05-07T20:32:54.8622234Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8622492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8622602Z module_map=module_map) 2025-05-07T20:32:54.8622768Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8622867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8622945Z E ^ 2025-05-07T20:32:54.8623298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8623305Z 2025-05-07T20:32:54.8623749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8623754Z 2025-05-07T20:32:54.8623868Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8624088Z self=, 2025-05-07T20:32:54.8624164Z T=128, 2025-05-07T20:32:54.8624247Z D=7168, 2025-05-07T20:32:54.8624331Z scale_ub=None, 2025-05-07T20:32:54.8624415Z contiguous=False, 2025-05-07T20:32:54.8624508Z compiled=True, 2025-05-07T20:32:54.8624597Z ) 2025-05-07T20:32:54.8632976Z self = 2025-05-07T20:32:54.8633186Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.8633192Z 2025-05-07T20:32:54.8633275Z @given( 2025-05-07T20:32:54.8633411Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8633513Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8633635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8633862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8633982Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8634062Z ) 2025-05-07T20:32:54.8634318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8634418Z def test_silu_mul_quant( 2025-05-07T20:32:54.8634497Z self, 2025-05-07T20:32:54.8634582Z T: int, 2025-05-07T20:32:54.8634658Z D: int, 2025-05-07T20:32:54.8634766Z scale_ub: Optional[float], 2025-05-07T20:32:54.8634855Z contiguous: bool, 2025-05-07T20:32:54.8634940Z compiled: bool, 2025-05-07T20:32:54.8635025Z ) -> None: 2025-05-07T20:32:54.8635120Z torch.manual_seed(2025) 2025-05-07T20:32:54.8635239Z 2025-05-07T20:32:54.8635415Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8635488Z 2025-05-07T20:32:54.8635584Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8635720Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8635809Z x = x_sign * x_clamp 2025-05-07T20:32:54.8635890Z x0 = x[:, :D] 2025-05-07T20:32:54.8635979Z x1 = x[:, D:] 2025-05-07T20:32:54.8636054Z 2025-05-07T20:32:54.8636139Z if contiguous: 2025-05-07T20:32:54.8636239Z x0 = x0.contiguous() 2025-05-07T20:32:54.8636328Z x1 = x1.contiguous() 2025-05-07T20:32:54.8636409Z 2025-05-07T20:32:54.8636502Z if scale_ub is not None: 2025-05-07T20:32:54.8636610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8636756Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8636833Z ) 2025-05-07T20:32:54.8636911Z else: 2025-05-07T20:32:54.8637021Z scale_ub_tensor = None 2025-05-07T20:32:54.8637097Z 2025-05-07T20:32:54.8637231Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8637333Z op = silu_mul_quant 2025-05-07T20:32:54.8637422Z if compiled: 2025-05-07T20:32:54.8637524Z op = torch.compile(op) 2025-05-07T20:32:54.8637641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8637762Z 2025-05-07T20:32:54.8637863Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8637986Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8638062Z 2025-05-07T20:32:54.8638206Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8638312Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8638412Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8638543Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8638683Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8638761Z 2025-05-07T20:32:54.8638869Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8638873Z 2025-05-07T20:32:54.8639012Z moe/activation_test.py:126: 2025-05-07T20:32:54.8639157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8639264Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8639401Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8639963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8640065Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8640424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8640655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8641025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8641293Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8641702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8641873Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8642219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8642300Z fn() 2025-05-07T20:32:54.8642703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8642792Z self.fn.run( 2025-05-07T20:32:54.8643130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8643235Z kernel = self.compile( 2025-05-07T20:32:54.8643651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8643831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8643971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8643976Z 2025-05-07T20:32:54.8644182Z self = 2025-05-07T20:32:54.8644959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8645461Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4683a60>} 2025-05-07T20:32:54.8646209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8646405Z context = 2025-05-07T20:32:54.8646410Z 2025-05-07T20:32:54.8646574Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8646883Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8646993Z module_map=module_map) 2025-05-07T20:32:54.8647155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8647265Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8647345Z E ^ 2025-05-07T20:32:54.8647696Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8647701Z 2025-05-07T20:32:54.8648108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8648113Z 2025-05-07T20:32:54.8648263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8648487Z self=, 2025-05-07T20:32:54.8648574Z T=128, 2025-05-07T20:32:54.8648654Z D=7168, 2025-05-07T20:32:54.8648742Z scale_ub=None, 2025-05-07T20:32:54.8648836Z contiguous=False, 2025-05-07T20:32:54.8648921Z compiled=False, 2025-05-07T20:32:54.8648999Z ) 2025-05-07T20:32:54.8649219Z self = 2025-05-07T20:32:54.8649391Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.8649395Z 2025-05-07T20:32:54.8649474Z @given( 2025-05-07T20:32:54.8649601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8649702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8649822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8649945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8650061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8650187Z ) 2025-05-07T20:32:54.8650432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8650529Z def test_silu_mul_quant( 2025-05-07T20:32:54.8650614Z self, 2025-05-07T20:32:54.8650693Z T: int, 2025-05-07T20:32:54.8650770Z D: int, 2025-05-07T20:32:54.8650873Z scale_ub: Optional[float], 2025-05-07T20:32:54.8650962Z contiguous: bool, 2025-05-07T20:32:54.8651047Z compiled: bool, 2025-05-07T20:32:54.8651134Z ) -> None: 2025-05-07T20:32:54.8651230Z torch.manual_seed(2025) 2025-05-07T20:32:54.8651303Z 2025-05-07T20:32:54.8651477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8651551Z 2025-05-07T20:32:54.8651693Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8651817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8651909Z x = x_sign * x_clamp 2025-05-07T20:32:54.8651997Z x0 = x[:, :D] 2025-05-07T20:32:54.8652077Z x1 = x[:, D:] 2025-05-07T20:32:54.8652150Z 2025-05-07T20:32:54.8652238Z if contiguous: 2025-05-07T20:32:54.8652330Z x0 = x0.contiguous() 2025-05-07T20:32:54.8652418Z x1 = x1.contiguous() 2025-05-07T20:32:54.8652498Z 2025-05-07T20:32:54.8652590Z if scale_ub is not None: 2025-05-07T20:32:54.8652693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8652832Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8652908Z ) 2025-05-07T20:32:54.8652990Z else: 2025-05-07T20:32:54.8653084Z scale_ub_tensor = None 2025-05-07T20:32:54.8653156Z 2025-05-07T20:32:54.8653288Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8653381Z op = silu_mul_quant 2025-05-07T20:32:54.8653466Z if compiled: 2025-05-07T20:32:54.8653572Z op = torch.compile(op) 2025-05-07T20:32:54.8653804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8653883Z 2025-05-07T20:32:54.8653976Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8653980Z 2025-05-07T20:32:54.8654127Z moe/activation_test.py:117: 2025-05-07T20:32:54.8654253Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8654357Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8654454Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8654946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8655040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8655394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8655619Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8655998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8656097Z kernel = self.compile( 2025-05-07T20:32:54.8656474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8656647Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8656779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8656783Z 2025-05-07T20:32:54.8656982Z self = 2025-05-07T20:32:54.8657748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8658286Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4105e40>} 2025-05-07T20:32:54.8659014Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8659209Z context = 2025-05-07T20:32:54.8659214Z 2025-05-07T20:32:54.8659373Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8659631Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8659740Z module_map=module_map) 2025-05-07T20:32:54.8659899Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8660045Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8660124Z E ^ 2025-05-07T20:32:54.8660522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8660532Z 2025-05-07T20:32:54.8660935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8660941Z 2025-05-07T20:32:54.8661044Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8661264Z self=, 2025-05-07T20:32:54.8661341Z T=4096, 2025-05-07T20:32:54.8661417Z D=5120, 2025-05-07T20:32:54.8661508Z scale_ub=1200.0, 2025-05-07T20:32:54.8661589Z contiguous=True, 2025-05-07T20:32:54.8661671Z compiled=False, 2025-05-07T20:32:54.8661747Z ) 2025-05-07T20:32:54.8661962Z self = 2025-05-07T20:32:54.8662139Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.8662144Z 2025-05-07T20:32:54.8662222Z @given( 2025-05-07T20:32:54.8662341Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8662448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8662562Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8662721Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8662839Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8662914Z ) 2025-05-07T20:32:54.8663152Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8663250Z def test_silu_mul_quant( 2025-05-07T20:32:54.8663325Z self, 2025-05-07T20:32:54.8663406Z T: int, 2025-05-07T20:32:54.8663481Z D: int, 2025-05-07T20:32:54.8663576Z scale_ub: Optional[float], 2025-05-07T20:32:54.8663667Z contiguous: bool, 2025-05-07T20:32:54.8663754Z compiled: bool, 2025-05-07T20:32:54.8663829Z ) -> None: 2025-05-07T20:32:54.8663993Z torch.manual_seed(2025) 2025-05-07T20:32:54.8664065Z 2025-05-07T20:32:54.8664232Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8664309Z 2025-05-07T20:32:54.8664399Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8664524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8664615Z x = x_sign * x_clamp 2025-05-07T20:32:54.8664692Z x0 = x[:, :D] 2025-05-07T20:32:54.8664775Z x1 = x[:, D:] 2025-05-07T20:32:54.8664847Z 2025-05-07T20:32:54.8664929Z if contiguous: 2025-05-07T20:32:54.8665022Z x0 = x0.contiguous() 2025-05-07T20:32:54.8665109Z x1 = x1.contiguous() 2025-05-07T20:32:54.8665180Z 2025-05-07T20:32:54.8665276Z if scale_ub is not None: 2025-05-07T20:32:54.8665376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8665509Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8665590Z ) 2025-05-07T20:32:54.8665665Z else: 2025-05-07T20:32:54.8665758Z scale_ub_tensor = None 2025-05-07T20:32:54.8665880Z 2025-05-07T20:32:54.8666007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8666095Z op = silu_mul_quant 2025-05-07T20:32:54.8666187Z if compiled: 2025-05-07T20:32:54.8666282Z op = torch.compile(op) 2025-05-07T20:32:54.8666389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8666458Z 2025-05-07T20:32:54.8666547Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8666552Z 2025-05-07T20:32:54.8666653Z moe/activation_test.py:117: 2025-05-07T20:32:54.8666777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8666873Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8666974Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8667507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8667611Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8667963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8668181Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8668519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8668612Z kernel = self.compile( 2025-05-07T20:32:54.8668990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8669166Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8669288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8669295Z 2025-05-07T20:32:54.8669500Z self = 2025-05-07T20:32:54.8670264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8670804Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c41068e0>} 2025-05-07T20:32:54.8671537Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8671724Z context = 2025-05-07T20:32:54.8671729Z 2025-05-07T20:32:54.8671899Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8672193Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8672310Z module_map=module_map) 2025-05-07T20:32:54.8672469Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8672569Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8672650Z E ^ 2025-05-07T20:32:54.8672994Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8672999Z 2025-05-07T20:32:54.8673401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8673406Z 2025-05-07T20:32:54.8673511Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8673729Z self=, 2025-05-07T20:32:54.8673813Z T=1, 2025-05-07T20:32:54.8673887Z D=5120, 2025-05-07T20:32:54.8673964Z scale_ub=None, 2025-05-07T20:32:54.8674054Z contiguous=True, 2025-05-07T20:32:54.8674135Z compiled=True, 2025-05-07T20:32:54.8674247Z ) 2025-05-07T20:32:54.8674469Z self = 2025-05-07T20:32:54.8674625Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8674632Z 2025-05-07T20:32:54.8674707Z @given( 2025-05-07T20:32:54.8674830Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8674928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8675046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8675163Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8675273Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8675353Z ) 2025-05-07T20:32:54.8675594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8675728Z def test_silu_mul_quant( 2025-05-07T20:32:54.8675810Z self, 2025-05-07T20:32:54.8675885Z T: int, 2025-05-07T20:32:54.8675958Z D: int, 2025-05-07T20:32:54.8676067Z scale_ub: Optional[float], 2025-05-07T20:32:54.8676152Z contiguous: bool, 2025-05-07T20:32:54.8676235Z compiled: bool, 2025-05-07T20:32:54.8676320Z ) -> None: 2025-05-07T20:32:54.8676414Z torch.manual_seed(2025) 2025-05-07T20:32:54.8676487Z 2025-05-07T20:32:54.8676658Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8676730Z 2025-05-07T20:32:54.8676827Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8676953Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8677044Z x = x_sign * x_clamp 2025-05-07T20:32:54.8677131Z x0 = x[:, :D] 2025-05-07T20:32:54.8677210Z x1 = x[:, D:] 2025-05-07T20:32:54.8677279Z 2025-05-07T20:32:54.8677370Z if contiguous: 2025-05-07T20:32:54.8677463Z x0 = x0.contiguous() 2025-05-07T20:32:54.8677550Z x1 = x1.contiguous() 2025-05-07T20:32:54.8677628Z 2025-05-07T20:32:54.8677721Z if scale_ub is not None: 2025-05-07T20:32:54.8677830Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8677961Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8678082Z ) 2025-05-07T20:32:54.8678164Z else: 2025-05-07T20:32:54.8678255Z scale_ub_tensor = None 2025-05-07T20:32:54.8678332Z 2025-05-07T20:32:54.8678464Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8678552Z op = silu_mul_quant 2025-05-07T20:32:54.8678634Z if compiled: 2025-05-07T20:32:54.8678739Z op = torch.compile(op) 2025-05-07T20:32:54.8678842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8678918Z 2025-05-07T20:32:54.8679014Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8679134Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8679209Z 2025-05-07T20:32:54.8679383Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8679486Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8679588Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8679712Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8679849Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8679927Z 2025-05-07T20:32:54.8680023Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8680028Z 2025-05-07T20:32:54.8680121Z moe/activation_test.py:126: 2025-05-07T20:32:54.8680251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8680354Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8680490Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8681036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8681137Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8681539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8681759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8682119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8682376Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8682742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8682910Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8683287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8683363Z fn() 2025-05-07T20:32:54.8683767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8683848Z self.fn.run( 2025-05-07T20:32:54.8684186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8684280Z kernel = self.compile( 2025-05-07T20:32:54.8684653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8684829Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8684954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8684958Z 2025-05-07T20:32:54.8685162Z self = 2025-05-07T20:32:54.8685934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8686425Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe9c4107560>} 2025-05-07T20:32:54.8687209Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8687397Z context = 2025-05-07T20:32:54.8687402Z 2025-05-07T20:32:54.8687569Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8687825Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8687933Z module_map=module_map) 2025-05-07T20:32:54.8688140Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8688245Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8688321Z E ^ 2025-05-07T20:32:54.8688673Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8688681Z 2025-05-07T20:32:54.8689081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8689086Z 2025-05-07T20:32:54.8689193Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8689412Z self=, 2025-05-07T20:32:54.8689487Z T=2048, 2025-05-07T20:32:54.8689567Z D=5120, 2025-05-07T20:32:54.8689648Z scale_ub=None, 2025-05-07T20:32:54.8689732Z contiguous=True, 2025-05-07T20:32:54.8689824Z compiled=True, 2025-05-07T20:32:54.8689897Z ) 2025-05-07T20:32:54.8690119Z self = 2025-05-07T20:32:54.8690342Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8690348Z 2025-05-07T20:32:54.8690432Z @given( 2025-05-07T20:32:54.8690576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8690678Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8690789Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8690908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8691018Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8691089Z ) 2025-05-07T20:32:54.8691337Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8691429Z def test_silu_mul_quant( 2025-05-07T20:32:54.8691514Z self, 2025-05-07T20:32:54.8691632Z T: int, 2025-05-07T20:32:54.8691705Z D: int, 2025-05-07T20:32:54.8691809Z scale_ub: Optional[float], 2025-05-07T20:32:54.8691900Z contiguous: bool, 2025-05-07T20:32:54.8691986Z compiled: bool, 2025-05-07T20:32:54.8692072Z ) -> None: 2025-05-07T20:32:54.8692171Z torch.manual_seed(2025) 2025-05-07T20:32:54.8692242Z 2025-05-07T20:32:54.8692413Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8692491Z 2025-05-07T20:32:54.8692583Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8692706Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8692799Z x = x_sign * x_clamp 2025-05-07T20:32:54.8692880Z x0 = x[:, :D] 2025-05-07T20:32:54.8692959Z x1 = x[:, D:] 2025-05-07T20:32:54.8693039Z 2025-05-07T20:32:54.8693122Z if contiguous: 2025-05-07T20:32:54.8693211Z x0 = x0.contiguous() 2025-05-07T20:32:54.8693302Z x1 = x1.contiguous() 2025-05-07T20:32:54.8693375Z 2025-05-07T20:32:54.8693474Z if scale_ub is not None: 2025-05-07T20:32:54.8693580Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8693840Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8693923Z ) 2025-05-07T20:32:54.8693998Z else: 2025-05-07T20:32:54.8694092Z scale_ub_tensor = None 2025-05-07T20:32:54.8694246Z 2025-05-07T20:32:54.8694374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8694464Z op = silu_mul_quant 2025-05-07T20:32:54.8694552Z if compiled: 2025-05-07T20:32:54.8694650Z op = torch.compile(op) 2025-05-07T20:32:54.8694755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8694831Z 2025-05-07T20:32:54.8694920Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8695047Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8695118Z 2025-05-07T20:32:54.8695255Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8695365Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8695509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8695632Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8695778Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8695851Z 2025-05-07T20:32:54.8695948Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8695952Z 2025-05-07T20:32:54.8696059Z moe/activation_test.py:126: 2025-05-07T20:32:54.8696186Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8696296Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8696426Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8696973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8697077Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8697442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8697700Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8698071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8698659Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8699044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8699205Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8699540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8699625Z fn() 2025-05-07T20:32:54.8700159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8700254Z self.fn.run( 2025-05-07T20:32:54.8700587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8700676Z kernel = self.compile( 2025-05-07T20:32:54.8701056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8701226Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8701353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8701358Z 2025-05-07T20:32:54.8701566Z self = 2025-05-07T20:32:54.8702329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8702837Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b944cc0>} 2025-05-07T20:32:54.8703566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8703837Z context = 2025-05-07T20:32:54.8703841Z 2025-05-07T20:32:54.8704002Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8704258Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8704371Z module_map=module_map) 2025-05-07T20:32:54.8704531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8704630Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8704714Z E ^ 2025-05-07T20:32:54.8705137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8705142Z 2025-05-07T20:32:54.8705552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8705560Z 2025-05-07T20:32:54.8705659Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8705879Z self=, 2025-05-07T20:32:54.8705960Z T=128, 2025-05-07T20:32:54.8706037Z D=5120, 2025-05-07T20:32:54.8706115Z scale_ub=None, 2025-05-07T20:32:54.8706203Z contiguous=True, 2025-05-07T20:32:54.8706284Z compiled=True, 2025-05-07T20:32:54.8706359Z ) 2025-05-07T20:32:54.8706573Z self = 2025-05-07T20:32:54.8706739Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8706744Z 2025-05-07T20:32:54.8706828Z @given( 2025-05-07T20:32:54.8707012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8707112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8707231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8707347Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8707465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8707534Z ) 2025-05-07T20:32:54.8707774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8707873Z def test_silu_mul_quant( 2025-05-07T20:32:54.8707950Z self, 2025-05-07T20:32:54.8708026Z T: int, 2025-05-07T20:32:54.8708109Z D: int, 2025-05-07T20:32:54.8708204Z scale_ub: Optional[float], 2025-05-07T20:32:54.8708338Z contiguous: bool, 2025-05-07T20:32:54.8708431Z compiled: bool, 2025-05-07T20:32:54.8708509Z ) -> None: 2025-05-07T20:32:54.8708608Z torch.manual_seed(2025) 2025-05-07T20:32:54.8708689Z 2025-05-07T20:32:54.8708856Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8708930Z 2025-05-07T20:32:54.8709026Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8709152Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8709245Z x = x_sign * x_clamp 2025-05-07T20:32:54.8709324Z x0 = x[:, :D] 2025-05-07T20:32:54.8709403Z x1 = x[:, D:] 2025-05-07T20:32:54.8709482Z 2025-05-07T20:32:54.8709562Z if contiguous: 2025-05-07T20:32:54.8709649Z x0 = x0.contiguous() 2025-05-07T20:32:54.8709742Z x1 = x1.contiguous() 2025-05-07T20:32:54.8709812Z 2025-05-07T20:32:54.8709900Z if scale_ub is not None: 2025-05-07T20:32:54.8710010Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8710150Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8710227Z ) 2025-05-07T20:32:54.8710312Z else: 2025-05-07T20:32:54.8710406Z scale_ub_tensor = None 2025-05-07T20:32:54.8710481Z 2025-05-07T20:32:54.8710608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8710744Z op = silu_mul_quant 2025-05-07T20:32:54.8710832Z if compiled: 2025-05-07T20:32:54.8710929Z op = torch.compile(op) 2025-05-07T20:32:54.8711033Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8711115Z 2025-05-07T20:32:54.8711204Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8711322Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8711401Z 2025-05-07T20:32:54.8711537Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8711637Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8711744Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8711865Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8712052Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8712130Z 2025-05-07T20:32:54.8712228Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8712234Z 2025-05-07T20:32:54.8712338Z moe/activation_test.py:126: 2025-05-07T20:32:54.8712464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8712571Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8712708Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8713250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8713353Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8713707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8713931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8714341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8714592Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8714967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8715129Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8715463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8715543Z fn() 2025-05-07T20:32:54.8715937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8716058Z self.fn.run( 2025-05-07T20:32:54.8716399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8716496Z kernel = self.compile( 2025-05-07T20:32:54.8716878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8717052Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8717180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8717184Z 2025-05-07T20:32:54.8717396Z self = 2025-05-07T20:32:54.8718157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8718659Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ba11d00>} 2025-05-07T20:32:54.8719399Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8719632Z context = 2025-05-07T20:32:54.8719637Z 2025-05-07T20:32:54.8719802Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8720056Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8720169Z module_map=module_map) 2025-05-07T20:32:54.8720329Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8720430Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8720508Z E ^ 2025-05-07T20:32:54.8720856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8720900Z 2025-05-07T20:32:54.8721310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8721321Z 2025-05-07T20:32:54.8721421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8721641Z self=, 2025-05-07T20:32:54.8721723Z T=4096, 2025-05-07T20:32:54.8721800Z D=5120, 2025-05-07T20:32:54.8721883Z scale_ub=None, 2025-05-07T20:32:54.8721972Z contiguous=True, 2025-05-07T20:32:54.8722051Z compiled=True, 2025-05-07T20:32:54.8722122Z ) 2025-05-07T20:32:54.8722344Z self = 2025-05-07T20:32:54.8722511Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8722518Z 2025-05-07T20:32:54.8722597Z @given( 2025-05-07T20:32:54.8722714Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8722813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8722973Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8723090Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8723203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8723285Z ) 2025-05-07T20:32:54.8723532Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8723625Z def test_silu_mul_quant( 2025-05-07T20:32:54.8723704Z self, 2025-05-07T20:32:54.8723780Z T: int, 2025-05-07T20:32:54.8723855Z D: int, 2025-05-07T20:32:54.8723960Z scale_ub: Optional[float], 2025-05-07T20:32:54.8724047Z contiguous: bool, 2025-05-07T20:32:54.8724138Z compiled: bool, 2025-05-07T20:32:54.8724217Z ) -> None: 2025-05-07T20:32:54.8724376Z torch.manual_seed(2025) 2025-05-07T20:32:54.8724453Z 2025-05-07T20:32:54.8724621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8724693Z 2025-05-07T20:32:54.8724793Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8724918Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8725004Z x = x_sign * x_clamp 2025-05-07T20:32:54.8725092Z x0 = x[:, :D] 2025-05-07T20:32:54.8725170Z x1 = x[:, D:] 2025-05-07T20:32:54.8725241Z 2025-05-07T20:32:54.8725330Z if contiguous: 2025-05-07T20:32:54.8725417Z x0 = x0.contiguous() 2025-05-07T20:32:54.8725512Z x1 = x1.contiguous() 2025-05-07T20:32:54.8725581Z 2025-05-07T20:32:54.8725670Z if scale_ub is not None: 2025-05-07T20:32:54.8725781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8725914Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8725985Z ) 2025-05-07T20:32:54.8726071Z else: 2025-05-07T20:32:54.8726166Z scale_ub_tensor = None 2025-05-07T20:32:54.8726238Z 2025-05-07T20:32:54.8726374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8726464Z op = silu_mul_quant 2025-05-07T20:32:54.8726544Z if compiled: 2025-05-07T20:32:54.8726647Z op = torch.compile(op) 2025-05-07T20:32:54.8726800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8726877Z 2025-05-07T20:32:54.8726966Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8727085Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8727165Z 2025-05-07T20:32:54.8727297Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8727393Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8727497Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8727615Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8727754Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8727832Z 2025-05-07T20:32:54.8727971Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8727976Z 2025-05-07T20:32:54.8728079Z moe/activation_test.py:126: 2025-05-07T20:32:54.8728205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8728310Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8728446Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8728989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8729086Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8729447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8729663Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8730032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8730323Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8730692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8730863Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8731197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8731269Z fn() 2025-05-07T20:32:54.8731667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8731747Z self.fn.run( 2025-05-07T20:32:54.8732083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8732232Z kernel = self.compile( 2025-05-07T20:32:54.8732608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8732787Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8732912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8732919Z 2025-05-07T20:32:54.8733129Z self = 2025-05-07T20:32:54.8734046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8734543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b80f2e0>} 2025-05-07T20:32:54.8735286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8735475Z context = 2025-05-07T20:32:54.8735480Z 2025-05-07T20:32:54.8735694Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8735948Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8736052Z module_map=module_map) 2025-05-07T20:32:54.8736217Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8736316Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8736401Z E ^ 2025-05-07T20:32:54.8736748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8736755Z 2025-05-07T20:32:54.8737199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8737204Z 2025-05-07T20:32:54.8737314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8737534Z self=, 2025-05-07T20:32:54.8737611Z T=16384, 2025-05-07T20:32:54.8737694Z D=5120, 2025-05-07T20:32:54.8737775Z scale_ub=None, 2025-05-07T20:32:54.8737863Z contiguous=True, 2025-05-07T20:32:54.8737942Z compiled=True, 2025-05-07T20:32:54.8738015Z ) 2025-05-07T20:32:54.8738234Z self = 2025-05-07T20:32:54.8738400Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8738405Z 2025-05-07T20:32:54.8738480Z @given( 2025-05-07T20:32:54.8738603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8738701Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8738813Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8738936Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8739092Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8739172Z ) 2025-05-07T20:32:54.8739413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8739507Z def test_silu_mul_quant( 2025-05-07T20:32:54.8739590Z self, 2025-05-07T20:32:54.8739665Z T: int, 2025-05-07T20:32:54.8739744Z D: int, 2025-05-07T20:32:54.8739845Z scale_ub: Optional[float], 2025-05-07T20:32:54.8739932Z contiguous: bool, 2025-05-07T20:32:54.8740013Z compiled: bool, 2025-05-07T20:32:54.8740095Z ) -> None: 2025-05-07T20:32:54.8740188Z torch.manual_seed(2025) 2025-05-07T20:32:54.8740257Z 2025-05-07T20:32:54.8740431Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8740546Z 2025-05-07T20:32:54.8740643Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8740768Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8740858Z x = x_sign * x_clamp 2025-05-07T20:32:54.8740943Z x0 = x[:, :D] 2025-05-07T20:32:54.8741021Z x1 = x[:, D:] 2025-05-07T20:32:54.8741090Z 2025-05-07T20:32:54.8741181Z if contiguous: 2025-05-07T20:32:54.8741267Z x0 = x0.contiguous() 2025-05-07T20:32:54.8741354Z x1 = x1.contiguous() 2025-05-07T20:32:54.8741433Z 2025-05-07T20:32:54.8741523Z if scale_ub is not None: 2025-05-07T20:32:54.8741625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8741764Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8741841Z ) 2025-05-07T20:32:54.8741922Z else: 2025-05-07T20:32:54.8742013Z scale_ub_tensor = None 2025-05-07T20:32:54.8742085Z 2025-05-07T20:32:54.8742219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8742306Z op = silu_mul_quant 2025-05-07T20:32:54.8742392Z if compiled: 2025-05-07T20:32:54.8742497Z op = torch.compile(op) 2025-05-07T20:32:54.8742601Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8742672Z 2025-05-07T20:32:54.8742764Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8742935Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8743007Z 2025-05-07T20:32:54.8743147Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8743247Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8743350Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8743472Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8743609Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8743684Z 2025-05-07T20:32:54.8743781Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8743788Z 2025-05-07T20:32:54.8743881Z moe/activation_test.py:126: 2025-05-07T20:32:54.8744055Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8744162Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8744291Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8744843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8744941Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8745302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8745521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8745879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8746139Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8746546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8746716Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8747051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8747127Z fn() 2025-05-07T20:32:54.8747527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8747610Z self.fn.run( 2025-05-07T20:32:54.8747940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8748037Z kernel = self.compile( 2025-05-07T20:32:54.8748413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8748632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8748761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8748766Z 2025-05-07T20:32:54.8748967Z self = 2025-05-07T20:32:54.8749738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8750231Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b52ce00>} 2025-05-07T20:32:54.8750968Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8751161Z context = 2025-05-07T20:32:54.8751165Z 2025-05-07T20:32:54.8751333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8751588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8751757Z module_map=module_map) 2025-05-07T20:32:54.8751922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8752022Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8752095Z E ^ 2025-05-07T20:32:54.8752453Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8752458Z 2025-05-07T20:32:54.8752858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8752865Z 2025-05-07T20:32:54.8752970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8753231Z self=, 2025-05-07T20:32:54.8753307Z T=1, 2025-05-07T20:32:54.8753385Z D=5120, 2025-05-07T20:32:54.8753464Z scale_ub=1200.0, 2025-05-07T20:32:54.8753546Z contiguous=True, 2025-05-07T20:32:54.8753638Z compiled=True, 2025-05-07T20:32:54.8753707Z ) 2025-05-07T20:32:54.8753921Z self = 2025-05-07T20:32:54.8754095Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.8754100Z 2025-05-07T20:32:54.8754174Z @given( 2025-05-07T20:32:54.8766365Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8766487Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8766609Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8766734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8766845Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8766926Z ) 2025-05-07T20:32:54.8767297Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8767403Z def test_silu_mul_quant( 2025-05-07T20:32:54.8767483Z self, 2025-05-07T20:32:54.8767561Z T: int, 2025-05-07T20:32:54.8767643Z D: int, 2025-05-07T20:32:54.8767741Z scale_ub: Optional[float], 2025-05-07T20:32:54.8767827Z contiguous: bool, 2025-05-07T20:32:54.8767921Z compiled: bool, 2025-05-07T20:32:54.8768014Z ) -> None: 2025-05-07T20:32:54.8768149Z torch.manual_seed(2025) 2025-05-07T20:32:54.8768264Z 2025-05-07T20:32:54.8768460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8768530Z 2025-05-07T20:32:54.8768631Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8768755Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8768901Z x = x_sign * x_clamp 2025-05-07T20:32:54.8769006Z x0 = x[:, :D] 2025-05-07T20:32:54.8769120Z x1 = x[:, D:] 2025-05-07T20:32:54.8769229Z 2025-05-07T20:32:54.8769317Z if contiguous: 2025-05-07T20:32:54.8769406Z x0 = x0.contiguous() 2025-05-07T20:32:54.8769502Z x1 = x1.contiguous() 2025-05-07T20:32:54.8769574Z 2025-05-07T20:32:54.8769664Z if scale_ub is not None: 2025-05-07T20:32:54.8769779Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8769949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8770058Z ) 2025-05-07T20:32:54.8770142Z else: 2025-05-07T20:32:54.8770236Z scale_ub_tensor = None 2025-05-07T20:32:54.8770305Z 2025-05-07T20:32:54.8770441Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8770529Z op = silu_mul_quant 2025-05-07T20:32:54.8770622Z if compiled: 2025-05-07T20:32:54.8770741Z op = torch.compile(op) 2025-05-07T20:32:54.8770887Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8770981Z 2025-05-07T20:32:54.8771074Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8771079Z 2025-05-07T20:32:54.8771177Z moe/activation_test.py:117: 2025-05-07T20:32:54.8771315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8771489Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8771610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8772051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8772146Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8772728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8772845Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8773207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8773536Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8774066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8774164Z kernel = self.compile( 2025-05-07T20:32:54.8774555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8774730Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8774867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8774872Z 2025-05-07T20:32:54.8775078Z self = 2025-05-07T20:32:54.8775842Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8776403Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b52cfe0>} 2025-05-07T20:32:54.8777139Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8777334Z context = 2025-05-07T20:32:54.8777339Z 2025-05-07T20:32:54.8777504Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8777767Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8777872Z module_map=module_map) 2025-05-07T20:32:54.8778073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8778185Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8778264Z E ^ 2025-05-07T20:32:54.8778615Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8778622Z 2025-05-07T20:32:54.8779036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8779041Z 2025-05-07T20:32:54.8779141Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8779370Z self=, 2025-05-07T20:32:54.8779445Z T=1, 2025-05-07T20:32:54.8779522Z D=5120, 2025-05-07T20:32:54.8779611Z scale_ub=None, 2025-05-07T20:32:54.8779695Z contiguous=False, 2025-05-07T20:32:54.8779777Z compiled=True, 2025-05-07T20:32:54.8779855Z ) 2025-05-07T20:32:54.8780071Z self = 2025-05-07T20:32:54.8780232Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.8780245Z 2025-05-07T20:32:54.8780322Z @given( 2025-05-07T20:32:54.8780441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8780547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8780705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8780821Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8780943Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8781017Z ) 2025-05-07T20:32:54.8781262Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8781353Z def test_silu_mul_quant( 2025-05-07T20:32:54.8781431Z self, 2025-05-07T20:32:54.8781513Z T: int, 2025-05-07T20:32:54.8781586Z D: int, 2025-05-07T20:32:54.8781684Z scale_ub: Optional[float], 2025-05-07T20:32:54.8781780Z contiguous: bool, 2025-05-07T20:32:54.8781863Z compiled: bool, 2025-05-07T20:32:54.8781992Z ) -> None: 2025-05-07T20:32:54.8782086Z torch.manual_seed(2025) 2025-05-07T20:32:54.8784003Z 2025-05-07T20:32:54.8784177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8784253Z 2025-05-07T20:32:54.8784343Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8784473Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8784561Z x = x_sign * x_clamp 2025-05-07T20:32:54.8784637Z x0 = x[:, :D] 2025-05-07T20:32:54.8784721Z x1 = x[:, D:] 2025-05-07T20:32:54.8784793Z 2025-05-07T20:32:54.8784872Z if contiguous: 2025-05-07T20:32:54.8784965Z x0 = x0.contiguous() 2025-05-07T20:32:54.8785052Z x1 = x1.contiguous() 2025-05-07T20:32:54.8785125Z 2025-05-07T20:32:54.8785220Z if scale_ub is not None: 2025-05-07T20:32:54.8785327Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8785469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8785544Z ) 2025-05-07T20:32:54.8785663Z else: 2025-05-07T20:32:54.8785765Z scale_ub_tensor = None 2025-05-07T20:32:54.8785837Z 2025-05-07T20:32:54.8785964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8786059Z op = silu_mul_quant 2025-05-07T20:32:54.8786140Z if compiled: 2025-05-07T20:32:54.8786238Z op = torch.compile(op) 2025-05-07T20:32:54.8786348Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8786418Z 2025-05-07T20:32:54.8786505Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8786628Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8786699Z 2025-05-07T20:32:54.8786836Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8786980Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8787077Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8787205Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8787344Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8787415Z 2025-05-07T20:32:54.8787520Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8787526Z 2025-05-07T20:32:54.8787619Z moe/activation_test.py:126: 2025-05-07T20:32:54.8787745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8787852Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8787982Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8788535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8788632Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8788989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8789214Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8789577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8789879Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8790247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8790409Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8790749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8790822Z fn() 2025-05-07T20:32:54.8791215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8791304Z self.fn.run( 2025-05-07T20:32:54.8791677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8791775Z kernel = self.compile( 2025-05-07T20:32:54.8792149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8792322Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8792452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8792457Z 2025-05-07T20:32:54.8792661Z self = 2025-05-07T20:32:54.8793428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8793928Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab27ec0>} 2025-05-07T20:32:54.8794695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8794890Z context = 2025-05-07T20:32:54.8794895Z 2025-05-07T20:32:54.8795053Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8795311Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8795416Z module_map=module_map) 2025-05-07T20:32:54.8795573Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8795679Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8795793Z E ^ 2025-05-07T20:32:54.8796143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8796154Z 2025-05-07T20:32:54.8796556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8796563Z 2025-05-07T20:32:54.8796663Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8796887Z self=, 2025-05-07T20:32:54.8796962Z T=1, 2025-05-07T20:32:54.8797037Z D=5120, 2025-05-07T20:32:54.8797126Z scale_ub=None, 2025-05-07T20:32:54.8797210Z contiguous=True, 2025-05-07T20:32:54.8797292Z compiled=False, 2025-05-07T20:32:54.8797369Z ) 2025-05-07T20:32:54.8797582Z self = 2025-05-07T20:32:54.8797745Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.8797752Z 2025-05-07T20:32:54.8797828Z @given( 2025-05-07T20:32:54.8797948Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8798054Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8798407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8798585Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8798881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8798954Z ) 2025-05-07T20:32:54.8799201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8799293Z def test_silu_mul_quant( 2025-05-07T20:32:54.8799369Z self, 2025-05-07T20:32:54.8799453Z T: int, 2025-05-07T20:32:54.8799526Z D: int, 2025-05-07T20:32:54.8799623Z scale_ub: Optional[float], 2025-05-07T20:32:54.8799716Z contiguous: bool, 2025-05-07T20:32:54.8799799Z compiled: bool, 2025-05-07T20:32:54.8799879Z ) -> None: 2025-05-07T20:32:54.8799982Z torch.manual_seed(2025) 2025-05-07T20:32:54.8800057Z 2025-05-07T20:32:54.8800317Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8800401Z 2025-05-07T20:32:54.8800490Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8800611Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8800705Z x = x_sign * x_clamp 2025-05-07T20:32:54.8800781Z x0 = x[:, :D] 2025-05-07T20:32:54.8800865Z x1 = x[:, D:] 2025-05-07T20:32:54.8800934Z 2025-05-07T20:32:54.8801012Z if contiguous: 2025-05-07T20:32:54.8801111Z x0 = x0.contiguous() 2025-05-07T20:32:54.8801196Z x1 = x1.contiguous() 2025-05-07T20:32:54.8801267Z 2025-05-07T20:32:54.8801361Z if scale_ub is not None: 2025-05-07T20:32:54.8801464Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8801598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8801684Z ) 2025-05-07T20:32:54.8801757Z else: 2025-05-07T20:32:54.8801847Z scale_ub_tensor = None 2025-05-07T20:32:54.8801928Z 2025-05-07T20:32:54.8802123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8802218Z op = silu_mul_quant 2025-05-07T20:32:54.8802300Z if compiled: 2025-05-07T20:32:54.8802400Z op = torch.compile(op) 2025-05-07T20:32:54.8802509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8802578Z 2025-05-07T20:32:54.8802664Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8802669Z 2025-05-07T20:32:54.8802770Z moe/activation_test.py:117: 2025-05-07T20:32:54.8802895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8802991Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8803091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8803580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8803746Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8804101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8804317Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8804659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8804748Z kernel = self.compile( 2025-05-07T20:32:54.8805124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8805299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8805423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8805427Z 2025-05-07T20:32:54.8805634Z self = 2025-05-07T20:32:54.8806395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8806896Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab2c7c0>} 2025-05-07T20:32:54.8807667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8807855Z context = 2025-05-07T20:32:54.8807859Z 2025-05-07T20:32:54.8808028Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8808284Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8808436Z module_map=module_map) 2025-05-07T20:32:54.8808598Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8808692Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8808769Z E ^ 2025-05-07T20:32:54.8809119Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8809124Z 2025-05-07T20:32:54.8809525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8809536Z 2025-05-07T20:32:54.8809634Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8809851Z self=, 2025-05-07T20:32:54.8809932Z T=128, 2025-05-07T20:32:54.8810005Z D=5120, 2025-05-07T20:32:54.8810090Z scale_ub=None, 2025-05-07T20:32:54.8810181Z contiguous=False, 2025-05-07T20:32:54.8810261Z compiled=True, 2025-05-07T20:32:54.8810335Z ) 2025-05-07T20:32:54.8810594Z self = 2025-05-07T20:32:54.8810759Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.8810766Z 2025-05-07T20:32:54.8810846Z @given( 2025-05-07T20:32:54.8810964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8811062Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8811179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8811291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8811400Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8811478Z ) 2025-05-07T20:32:54.8811718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8811852Z def test_silu_mul_quant( 2025-05-07T20:32:54.8811936Z self, 2025-05-07T20:32:54.8812011Z T: int, 2025-05-07T20:32:54.8812085Z D: int, 2025-05-07T20:32:54.8812189Z scale_ub: Optional[float], 2025-05-07T20:32:54.8812277Z contiguous: bool, 2025-05-07T20:32:54.8812366Z compiled: bool, 2025-05-07T20:32:54.8812443Z ) -> None: 2025-05-07T20:32:54.8812540Z torch.manual_seed(2025) 2025-05-07T20:32:54.8812619Z 2025-05-07T20:32:54.8812785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8812857Z 2025-05-07T20:32:54.8812952Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8813073Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8813158Z x = x_sign * x_clamp 2025-05-07T20:32:54.8813244Z x0 = x[:, :D] 2025-05-07T20:32:54.8813320Z x1 = x[:, D:] 2025-05-07T20:32:54.8813393Z 2025-05-07T20:32:54.8813480Z if contiguous: 2025-05-07T20:32:54.8813573Z x0 = x0.contiguous() 2025-05-07T20:32:54.8813789Z x1 = x1.contiguous() 2025-05-07T20:32:54.8813871Z 2025-05-07T20:32:54.8813963Z if scale_ub is not None: 2025-05-07T20:32:54.8814078Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8814211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8814286Z ) 2025-05-07T20:32:54.8814416Z else: 2025-05-07T20:32:54.8814508Z scale_ub_tensor = None 2025-05-07T20:32:54.8814580Z 2025-05-07T20:32:54.8814714Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8814802Z op = silu_mul_quant 2025-05-07T20:32:54.8814885Z if compiled: 2025-05-07T20:32:54.8814988Z op = torch.compile(op) 2025-05-07T20:32:54.8815091Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8815163Z 2025-05-07T20:32:54.8815259Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8815263Z 2025-05-07T20:32:54.8815359Z moe/activation_test.py:117: 2025-05-07T20:32:54.8815491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8815631Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8815731Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8816096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8816190Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8816673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8816779Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8817130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8817351Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8817682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8817776Z kernel = self.compile( 2025-05-07T20:32:54.8818205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8818375Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8818510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8818514Z 2025-05-07T20:32:54.8818712Z self = 2025-05-07T20:32:54.8819471Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8819975Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ba12d40>} 2025-05-07T20:32:54.8820749Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8820942Z context = 2025-05-07T20:32:54.8820949Z 2025-05-07T20:32:54.8821108Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8821362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8821472Z module_map=module_map) 2025-05-07T20:32:54.8821629Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8821733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8821808Z E ^ 2025-05-07T20:32:54.8822155Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8822162Z 2025-05-07T20:32:54.8822574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8822578Z 2025-05-07T20:32:54.8822681Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8822903Z self=, 2025-05-07T20:32:54.8823022Z T=128, 2025-05-07T20:32:54.8823095Z D=7168, 2025-05-07T20:32:54.8823182Z scale_ub=1200.0, 2025-05-07T20:32:54.8823264Z contiguous=False, 2025-05-07T20:32:54.8823346Z compiled=False, 2025-05-07T20:32:54.8823424Z ) 2025-05-07T20:32:54.8823637Z self = 2025-05-07T20:32:54.8823802Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.8823806Z 2025-05-07T20:32:54.8823885Z @given( 2025-05-07T20:32:54.8824010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8824113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8824266Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8824382Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8824497Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8824571Z ) 2025-05-07T20:32:54.8824814Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8824913Z def test_silu_mul_quant( 2025-05-07T20:32:54.8824988Z self, 2025-05-07T20:32:54.8825063Z T: int, 2025-05-07T20:32:54.8825143Z D: int, 2025-05-07T20:32:54.8825239Z scale_ub: Optional[float], 2025-05-07T20:32:54.8825326Z contiguous: bool, 2025-05-07T20:32:54.8825415Z compiled: bool, 2025-05-07T20:32:54.8825489Z ) -> None: 2025-05-07T20:32:54.8825589Z torch.manual_seed(2025) 2025-05-07T20:32:54.8825659Z 2025-05-07T20:32:54.8825826Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8825905Z 2025-05-07T20:32:54.8825995Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8826163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8826258Z x = x_sign * x_clamp 2025-05-07T20:32:54.8826337Z x0 = x[:, :D] 2025-05-07T20:32:54.8826415Z x1 = x[:, D:] 2025-05-07T20:32:54.8826494Z 2025-05-07T20:32:54.8826574Z if contiguous: 2025-05-07T20:32:54.8826661Z x0 = x0.contiguous() 2025-05-07T20:32:54.8826760Z x1 = x1.contiguous() 2025-05-07T20:32:54.8826831Z 2025-05-07T20:32:54.8826921Z if scale_ub is not None: 2025-05-07T20:32:54.8827028Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8827185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8827261Z ) 2025-05-07T20:32:54.8827336Z else: 2025-05-07T20:32:54.8827477Z scale_ub_tensor = None 2025-05-07T20:32:54.8827549Z 2025-05-07T20:32:54.8827682Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8827771Z op = silu_mul_quant 2025-05-07T20:32:54.8827855Z if compiled: 2025-05-07T20:32:54.8827960Z op = torch.compile(op) 2025-05-07T20:32:54.8828063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8828137Z 2025-05-07T20:32:54.8828235Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8828239Z 2025-05-07T20:32:54.8828332Z moe/activation_test.py:117: 2025-05-07T20:32:54.8828458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8828562Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8828658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8829152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8829250Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8829605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8829833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8830166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8830324Z kernel = self.compile( 2025-05-07T20:32:54.8830709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8830878Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8831011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8831016Z 2025-05-07T20:32:54.8831216Z self = 2025-05-07T20:32:54.8832015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8832523Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b34b880>} 2025-05-07T20:32:54.8833252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8833446Z context = 2025-05-07T20:32:54.8833451Z 2025-05-07T20:32:54.8833611Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8833877Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8833985Z module_map=module_map) 2025-05-07T20:32:54.8834143Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8834248Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8834367Z E ^ 2025-05-07T20:32:54.8834713Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8834719Z 2025-05-07T20:32:54.8835128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8835132Z 2025-05-07T20:32:54.8835231Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8835451Z self=, 2025-05-07T20:32:54.8835525Z T=128, 2025-05-07T20:32:54.8835600Z D=5120, 2025-05-07T20:32:54.8835689Z scale_ub=None, 2025-05-07T20:32:54.8835773Z contiguous=False, 2025-05-07T20:32:54.8835852Z compiled=False, 2025-05-07T20:32:54.8835970Z ) 2025-05-07T20:32:54.8836182Z self = 2025-05-07T20:32:54.8836349Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.8836360Z 2025-05-07T20:32:54.8836435Z @given( 2025-05-07T20:32:54.8836551Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8836656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8836768Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8836882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8836999Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8837071Z ) 2025-05-07T20:32:54.8837309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8837404Z def test_silu_mul_quant( 2025-05-07T20:32:54.8837480Z self, 2025-05-07T20:32:54.8837552Z T: int, 2025-05-07T20:32:54.8837636Z D: int, 2025-05-07T20:32:54.8837731Z scale_ub: Optional[float], 2025-05-07T20:32:54.8837829Z contiguous: bool, 2025-05-07T20:32:54.8837916Z compiled: bool, 2025-05-07T20:32:54.8837992Z ) -> None: 2025-05-07T20:32:54.8838090Z torch.manual_seed(2025) 2025-05-07T20:32:54.8838161Z 2025-05-07T20:32:54.8838328Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8838450Z 2025-05-07T20:32:54.8838541Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8838664Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8838753Z x = x_sign * x_clamp 2025-05-07T20:32:54.8838829Z x0 = x[:, :D] 2025-05-07T20:32:54.8838908Z x1 = x[:, D:] 2025-05-07T20:32:54.8838985Z 2025-05-07T20:32:54.8839065Z if contiguous: 2025-05-07T20:32:54.8839153Z x0 = x0.contiguous() 2025-05-07T20:32:54.8839244Z x1 = x1.contiguous() 2025-05-07T20:32:54.8839318Z 2025-05-07T20:32:54.8839411Z if scale_ub is not None: 2025-05-07T20:32:54.8839513Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8839686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8839769Z ) 2025-05-07T20:32:54.8839846Z else: 2025-05-07T20:32:54.8839938Z scale_ub_tensor = None 2025-05-07T20:32:54.8840020Z 2025-05-07T20:32:54.8840149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8840240Z op = silu_mul_quant 2025-05-07T20:32:54.8840329Z if compiled: 2025-05-07T20:32:54.8840427Z op = torch.compile(op) 2025-05-07T20:32:54.8840530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8840610Z 2025-05-07T20:32:54.8840698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8840703Z 2025-05-07T20:32:54.8840802Z moe/activation_test.py:117: 2025-05-07T20:32:54.8840929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8841028Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8841134Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8841666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8841760Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8842128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8842348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8842696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8842793Z kernel = self.compile( 2025-05-07T20:32:54.8843168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8843348Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8843518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8843525Z 2025-05-07T20:32:54.8843735Z self = 2025-05-07T20:32:54.8844489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8844985Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92ab2c040>} 2025-05-07T20:32:54.8845719Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8845907Z context = 2025-05-07T20:32:54.8845911Z 2025-05-07T20:32:54.8846080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8846336Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8846441Z module_map=module_map) 2025-05-07T20:32:54.8846652Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8846749Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8846829Z E ^ 2025-05-07T20:32:54.8847175Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8847179Z 2025-05-07T20:32:54.8847578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8847582Z 2025-05-07T20:32:54.8847694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8847914Z self=, 2025-05-07T20:32:54.8847990Z T=128, 2025-05-07T20:32:54.8848113Z D=5120, 2025-05-07T20:32:54.8848198Z scale_ub=1200.0, 2025-05-07T20:32:54.8848287Z contiguous=True, 2025-05-07T20:32:54.8848369Z compiled=False, 2025-05-07T20:32:54.8848442Z ) 2025-05-07T20:32:54.8848661Z self = 2025-05-07T20:32:54.8848827Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.8848831Z 2025-05-07T20:32:54.8848905Z @given( 2025-05-07T20:32:54.8849028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8849126Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8849236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8849357Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8849465Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8849547Z ) 2025-05-07T20:32:54.8849789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8849879Z def test_silu_mul_quant( 2025-05-07T20:32:54.8850003Z self, 2025-05-07T20:32:54.8850081Z T: int, 2025-05-07T20:32:54.8850154Z D: int, 2025-05-07T20:32:54.8850270Z scale_ub: Optional[float], 2025-05-07T20:32:54.8850373Z contiguous: bool, 2025-05-07T20:32:54.8850465Z compiled: bool, 2025-05-07T20:32:54.8850561Z ) -> None: 2025-05-07T20:32:54.8850653Z torch.manual_seed(2025) 2025-05-07T20:32:54.8850726Z 2025-05-07T20:32:54.8850895Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8850968Z 2025-05-07T20:32:54.8851062Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8851181Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8851267Z x = x_sign * x_clamp 2025-05-07T20:32:54.8851394Z x0 = x[:, :D] 2025-05-07T20:32:54.8851472Z x1 = x[:, D:] 2025-05-07T20:32:54.8851543Z 2025-05-07T20:32:54.8851635Z if contiguous: 2025-05-07T20:32:54.8851727Z x0 = x0.contiguous() 2025-05-07T20:32:54.8851816Z x1 = x1.contiguous() 2025-05-07T20:32:54.8851893Z 2025-05-07T20:32:54.8851981Z if scale_ub is not None: 2025-05-07T20:32:54.8852083Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8852221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8852295Z ) 2025-05-07T20:32:54.8852378Z else: 2025-05-07T20:32:54.8852468Z scale_ub_tensor = None 2025-05-07T20:32:54.8852540Z 2025-05-07T20:32:54.8852673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8852758Z op = silu_mul_quant 2025-05-07T20:32:54.8852839Z if compiled: 2025-05-07T20:32:54.8852940Z op = torch.compile(op) 2025-05-07T20:32:54.8853048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8853118Z 2025-05-07T20:32:54.8853212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8853216Z 2025-05-07T20:32:54.8853314Z moe/activation_test.py:117: 2025-05-07T20:32:54.8853440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8853541Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8853819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8854313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8854408Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8854764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8854988Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8855318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8855417Z kernel = self.compile( 2025-05-07T20:32:54.8855840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8856012Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8856143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8856147Z 2025-05-07T20:32:54.8856346Z self = 2025-05-07T20:32:54.8857106Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8857604Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d4c20>} 2025-05-07T20:32:54.8858375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8858570Z context = 2025-05-07T20:32:54.8858577Z 2025-05-07T20:32:54.8858739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8858999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8859104Z module_map=module_map) 2025-05-07T20:32:54.8859262Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8859371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8859448Z E ^ 2025-05-07T20:32:54.8859794Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8859864Z 2025-05-07T20:32:54.8860281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8860286Z 2025-05-07T20:32:54.8860388Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8860614Z self=, 2025-05-07T20:32:54.8860695Z T=1, 2025-05-07T20:32:54.8860769Z D=7168, 2025-05-07T20:32:54.8860858Z scale_ub=1200.0, 2025-05-07T20:32:54.8860942Z contiguous=True, 2025-05-07T20:32:54.8861022Z compiled=True, 2025-05-07T20:32:54.8861102Z ) 2025-05-07T20:32:54.8861315Z self = 2025-05-07T20:32:54.8861480Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.8861485Z 2025-05-07T20:32:54.8861562Z @given( 2025-05-07T20:32:54.8861683Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8861788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8861907Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8862024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8862142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8862263Z ) 2025-05-07T20:32:54.8862503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8862601Z def test_silu_mul_quant( 2025-05-07T20:32:54.8862678Z self, 2025-05-07T20:32:54.8862758Z T: int, 2025-05-07T20:32:54.8862834Z D: int, 2025-05-07T20:32:54.8862932Z scale_ub: Optional[float], 2025-05-07T20:32:54.8863029Z contiguous: bool, 2025-05-07T20:32:54.8863114Z compiled: bool, 2025-05-07T20:32:54.8863193Z ) -> None: 2025-05-07T20:32:54.8863291Z torch.manual_seed(2025) 2025-05-07T20:32:54.8863368Z 2025-05-07T20:32:54.8863532Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8863612Z 2025-05-07T20:32:54.8863746Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8863874Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8863968Z x = x_sign * x_clamp 2025-05-07T20:32:54.8864049Z x0 = x[:, :D] 2025-05-07T20:32:54.8864135Z x1 = x[:, D:] 2025-05-07T20:32:54.8864214Z 2025-05-07T20:32:54.8864294Z if contiguous: 2025-05-07T20:32:54.8864389Z x0 = x0.contiguous() 2025-05-07T20:32:54.8864476Z x1 = x1.contiguous() 2025-05-07T20:32:54.8864546Z 2025-05-07T20:32:54.8864640Z if scale_ub is not None: 2025-05-07T20:32:54.8864747Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8864876Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8864956Z ) 2025-05-07T20:32:54.8865031Z else: 2025-05-07T20:32:54.8865124Z scale_ub_tensor = None 2025-05-07T20:32:54.8865200Z 2025-05-07T20:32:54.8865325Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8865412Z op = silu_mul_quant 2025-05-07T20:32:54.8865546Z if compiled: 2025-05-07T20:32:54.8865644Z op = torch.compile(op) 2025-05-07T20:32:54.8865752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8865824Z 2025-05-07T20:32:54.8865913Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8865917Z 2025-05-07T20:32:54.8866018Z moe/activation_test.py:117: 2025-05-07T20:32:54.8866142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8866241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8866344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8866702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8866791Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8867326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8867421Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8867777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8867998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8868328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8868424Z kernel = self.compile( 2025-05-07T20:32:54.8868797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8868970Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8869092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8869100Z 2025-05-07T20:32:54.8869300Z self = 2025-05-07T20:32:54.8870067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8870601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d5ee0>} 2025-05-07T20:32:54.8871337Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8871523Z context = 2025-05-07T20:32:54.8871527Z 2025-05-07T20:32:54.8871691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8871991Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8872099Z module_map=module_map) 2025-05-07T20:32:54.8872264Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8872363Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8872439Z E ^ 2025-05-07T20:32:54.8872869Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8872877Z 2025-05-07T20:32:54.8873334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8873339Z 2025-05-07T20:32:54.8873446Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8873682Z self=, 2025-05-07T20:32:54.8873793Z T=1, 2025-05-07T20:32:54.8873901Z D=7168, 2025-05-07T20:32:54.8873986Z scale_ub=1200.0, 2025-05-07T20:32:54.8874072Z contiguous=False, 2025-05-07T20:32:54.8874160Z compiled=True, 2025-05-07T20:32:54.8874294Z ) 2025-05-07T20:32:54.8874523Z self = 2025-05-07T20:32:54.8874755Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8874764Z 2025-05-07T20:32:54.8874842Z @given( 2025-05-07T20:32:54.8874964Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8875060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8875173Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8875294Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8875424Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8875528Z ) 2025-05-07T20:32:54.8875844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8876119Z def test_silu_mul_quant( 2025-05-07T20:32:54.8876196Z self, 2025-05-07T20:32:54.8876310Z T: int, 2025-05-07T20:32:54.8876421Z D: int, 2025-05-07T20:32:54.8876542Z scale_ub: Optional[float], 2025-05-07T20:32:54.8876631Z contiguous: bool, 2025-05-07T20:32:54.8876712Z compiled: bool, 2025-05-07T20:32:54.8876799Z ) -> None: 2025-05-07T20:32:54.8876891Z torch.manual_seed(2025) 2025-05-07T20:32:54.8876960Z 2025-05-07T20:32:54.8877131Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8877202Z 2025-05-07T20:32:54.8877290Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8877417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8877502Z x = x_sign * x_clamp 2025-05-07T20:32:54.8877579Z x0 = x[:, :D] 2025-05-07T20:32:54.8877660Z x1 = x[:, D:] 2025-05-07T20:32:54.8877732Z 2025-05-07T20:32:54.8877815Z if contiguous: 2025-05-07T20:32:54.8877910Z x0 = x0.contiguous() 2025-05-07T20:32:54.8877996Z x1 = x1.contiguous() 2025-05-07T20:32:54.8878072Z 2025-05-07T20:32:54.8878161Z if scale_ub is not None: 2025-05-07T20:32:54.8878264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8878399Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8878529Z ) 2025-05-07T20:32:54.8878603Z else: 2025-05-07T20:32:54.8878702Z scale_ub_tensor = None 2025-05-07T20:32:54.8878772Z 2025-05-07T20:32:54.8878896Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8878988Z op = silu_mul_quant 2025-05-07T20:32:54.8879068Z if compiled: 2025-05-07T20:32:54.8879165Z op = torch.compile(op) 2025-05-07T20:32:54.8879272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8879343Z 2025-05-07T20:32:54.8879439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8879444Z 2025-05-07T20:32:54.8879537Z moe/activation_test.py:117: 2025-05-07T20:32:54.8879703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8879811Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8879907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8880273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8880374Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8880855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8880956Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8881305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8881522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8881863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8881955Z kernel = self.compile( 2025-05-07T20:32:54.8882373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8882550Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8882677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8882682Z 2025-05-07T20:32:54.8882886Z self = 2025-05-07T20:32:54.8883647Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8884144Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a7d6c00>} 2025-05-07T20:32:54.8884920Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8885109Z context = 2025-05-07T20:32:54.8885114Z 2025-05-07T20:32:54.8885281Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8885534Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8885644Z module_map=module_map) 2025-05-07T20:32:54.8885801Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8885897Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8885981Z E ^ 2025-05-07T20:32:54.8886329Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8886336Z 2025-05-07T20:32:54.8886740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8886752Z 2025-05-07T20:32:54.8886853Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8887111Z self=, 2025-05-07T20:32:54.8887191Z T=1, 2025-05-07T20:32:54.8887264Z D=7168, 2025-05-07T20:32:54.8887342Z scale_ub=None, 2025-05-07T20:32:54.8887432Z contiguous=False, 2025-05-07T20:32:54.8887514Z compiled=True, 2025-05-07T20:32:54.8887583Z ) 2025-05-07T20:32:54.8887800Z self = 2025-05-07T20:32:54.8887957Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.8887966Z 2025-05-07T20:32:54.8888054Z @given( 2025-05-07T20:32:54.8892591Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8892781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8892912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8893031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8893154Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8893235Z ) 2025-05-07T20:32:54.8893480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8893586Z def test_silu_mul_quant( 2025-05-07T20:32:54.8893767Z self, 2025-05-07T20:32:54.8893849Z T: int, 2025-05-07T20:32:54.8893937Z D: int, 2025-05-07T20:32:54.8894035Z scale_ub: Optional[float], 2025-05-07T20:32:54.8894128Z contiguous: bool, 2025-05-07T20:32:54.8894222Z compiled: bool, 2025-05-07T20:32:54.8894303Z ) -> None: 2025-05-07T20:32:54.8894401Z torch.manual_seed(2025) 2025-05-07T20:32:54.8894483Z 2025-05-07T20:32:54.8894651Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8894727Z 2025-05-07T20:32:54.8894905Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8895036Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8895134Z x = x_sign * x_clamp 2025-05-07T20:32:54.8895221Z x0 = x[:, :D] 2025-05-07T20:32:54.8895303Z x1 = x[:, D:] 2025-05-07T20:32:54.8895385Z 2025-05-07T20:32:54.8895473Z if contiguous: 2025-05-07T20:32:54.8895568Z x0 = x0.contiguous() 2025-05-07T20:32:54.8895668Z x1 = x1.contiguous() 2025-05-07T20:32:54.8895744Z 2025-05-07T20:32:54.8895835Z if scale_ub is not None: 2025-05-07T20:32:54.8895951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8896088Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8896166Z ) 2025-05-07T20:32:54.8896299Z else: 2025-05-07T20:32:54.8896397Z scale_ub_tensor = None 2025-05-07T20:32:54.8896480Z 2025-05-07T20:32:54.8896617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8896713Z op = silu_mul_quant 2025-05-07T20:32:54.8896807Z if compiled: 2025-05-07T20:32:54.8896910Z op = torch.compile(op) 2025-05-07T20:32:54.8897023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8897108Z 2025-05-07T20:32:54.8897200Z y_fp8, y_scale = fn() 2025-05-07T20:32:54.8897323Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:54.8897406Z 2025-05-07T20:32:54.8897551Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8897654Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:54.8897762Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:54.8897889Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:54.8898043Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8898120Z 2025-05-07T20:32:54.8898499Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:54.8898508Z 2025-05-07T20:32:54.8898627Z moe/activation_test.py:126: 2025-05-07T20:32:54.8898759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8899030Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:54.8899173Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:54.8899721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:54.8899830Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:54.8900185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8900407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8900780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:54.8901100Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:54.8901476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:54.8901643Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:54.8901975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:54.8902056Z fn() 2025-05-07T20:32:54.8902451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:54.8902532Z self.fn.run( 2025-05-07T20:32:54.8902874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8902970Z kernel = self.compile( 2025-05-07T20:32:54.8903359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8903599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8903729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8903738Z 2025-05-07T20:32:54.8903953Z self = 2025-05-07T20:32:54.8904718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8905224Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc00180>} 2025-05-07T20:32:54.8906021Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8906212Z context = 2025-05-07T20:32:54.8906217Z 2025-05-07T20:32:54.8906387Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8906647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8906764Z module_map=module_map) 2025-05-07T20:32:54.8906924Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8907026Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:54.8907107Z E ^ 2025-05-07T20:32:54.8907456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8907464Z 2025-05-07T20:32:54.8907879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8907886Z 2025-05-07T20:32:54.8907991Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8908209Z self=, 2025-05-07T20:32:54.8908335Z T=1, 2025-05-07T20:32:54.8908408Z D=5120, 2025-05-07T20:32:54.8908498Z scale_ub=1200.0, 2025-05-07T20:32:54.8908582Z contiguous=False, 2025-05-07T20:32:54.8908663Z compiled=True, 2025-05-07T20:32:54.8908742Z ) 2025-05-07T20:32:54.8908955Z self = 2025-05-07T20:32:54.8909115Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8909120Z 2025-05-07T20:32:54.8909202Z @given( 2025-05-07T20:32:54.8909323Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8909428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8909541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8909696Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8909818Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8909890Z ) 2025-05-07T20:32:54.8910132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8910233Z def test_silu_mul_quant( 2025-05-07T20:32:54.8910309Z self, 2025-05-07T20:32:54.8910384Z T: int, 2025-05-07T20:32:54.8910467Z D: int, 2025-05-07T20:32:54.8910563Z scale_ub: Optional[float], 2025-05-07T20:32:54.8910650Z contiguous: bool, 2025-05-07T20:32:54.8910741Z compiled: bool, 2025-05-07T20:32:54.8910817Z ) -> None: 2025-05-07T20:32:54.8910915Z torch.manual_seed(2025) 2025-05-07T20:32:54.8910985Z 2025-05-07T20:32:54.8911149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8911231Z 2025-05-07T20:32:54.8911321Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8911444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8911539Z x = x_sign * x_clamp 2025-05-07T20:32:54.8911658Z x0 = x[:, :D] 2025-05-07T20:32:54.8911738Z x1 = x[:, D:] 2025-05-07T20:32:54.8911815Z 2025-05-07T20:32:54.8911898Z if contiguous: 2025-05-07T20:32:54.8911986Z x0 = x0.contiguous() 2025-05-07T20:32:54.8912080Z x1 = x1.contiguous() 2025-05-07T20:32:54.8912151Z 2025-05-07T20:32:54.8912236Z if scale_ub is not None: 2025-05-07T20:32:54.8912343Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8912474Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8912552Z ) 2025-05-07T20:32:54.8912626Z else: 2025-05-07T20:32:54.8912718Z scale_ub_tensor = None 2025-05-07T20:32:54.8912795Z 2025-05-07T20:32:54.8912964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8913051Z op = silu_mul_quant 2025-05-07T20:32:54.8913140Z if compiled: 2025-05-07T20:32:54.8913238Z op = torch.compile(op) 2025-05-07T20:32:54.8913343Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8913424Z 2025-05-07T20:32:54.8913509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8913516Z 2025-05-07T20:32:54.8913615Z moe/activation_test.py:117: 2025-05-07T20:32:54.8913742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8913840Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8913942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8914301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8914390Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8914881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8914979Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8915339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8915557Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8915933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8916030Z kernel = self.compile( 2025-05-07T20:32:54.8916404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8916573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8916704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8916709Z 2025-05-07T20:32:54.8916908Z self = 2025-05-07T20:32:54.8917720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8918220Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc01300>} 2025-05-07T20:32:54.8918955Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8919144Z context = 2025-05-07T20:32:54.8919148Z 2025-05-07T20:32:54.8919307Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8919569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8919676Z module_map=module_map) 2025-05-07T20:32:54.8919874Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8919979Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8920055Z E ^ 2025-05-07T20:32:54.8920411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8920415Z 2025-05-07T20:32:54.8920818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8920822Z 2025-05-07T20:32:54.8920921Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8921143Z self=, 2025-05-07T20:32:54.8921219Z T=1, 2025-05-07T20:32:54.8921301Z D=5120, 2025-05-07T20:32:54.8921424Z scale_ub=1200.0, 2025-05-07T20:32:54.8921509Z contiguous=False, 2025-05-07T20:32:54.8921595Z compiled=False, 2025-05-07T20:32:54.8921668Z ) 2025-05-07T20:32:54.8921886Z self = 2025-05-07T20:32:54.8922054Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.8922059Z 2025-05-07T20:32:54.8922137Z @given( 2025-05-07T20:32:54.8922255Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8922356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8922466Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8922591Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8922702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8922777Z ) 2025-05-07T20:32:54.8923026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8923120Z def test_silu_mul_quant( 2025-05-07T20:32:54.8923197Z self, 2025-05-07T20:32:54.8923279Z T: int, 2025-05-07T20:32:54.8923354Z D: int, 2025-05-07T20:32:54.8923452Z scale_ub: Optional[float], 2025-05-07T20:32:54.8923550Z contiguous: bool, 2025-05-07T20:32:54.8923633Z compiled: bool, 2025-05-07T20:32:54.8923709Z ) -> None: 2025-05-07T20:32:54.8923811Z torch.manual_seed(2025) 2025-05-07T20:32:54.8923926Z 2025-05-07T20:32:54.8924099Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8924171Z 2025-05-07T20:32:54.8924261Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8924390Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8924479Z x = x_sign * x_clamp 2025-05-07T20:32:54.8924555Z x0 = x[:, :D] 2025-05-07T20:32:54.8924641Z x1 = x[:, D:] 2025-05-07T20:32:54.8924714Z 2025-05-07T20:32:54.8924796Z if contiguous: 2025-05-07T20:32:54.8924893Z x0 = x0.contiguous() 2025-05-07T20:32:54.8924984Z x1 = x1.contiguous() 2025-05-07T20:32:54.8925054Z 2025-05-07T20:32:54.8925219Z if scale_ub is not None: 2025-05-07T20:32:54.8925324Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8925456Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8925536Z ) 2025-05-07T20:32:54.8925618Z else: 2025-05-07T20:32:54.8925715Z scale_ub_tensor = None 2025-05-07T20:32:54.8925787Z 2025-05-07T20:32:54.8925913Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8926007Z op = silu_mul_quant 2025-05-07T20:32:54.8926088Z if compiled: 2025-05-07T20:32:54.8926184Z op = torch.compile(op) 2025-05-07T20:32:54.8926292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8926360Z 2025-05-07T20:32:54.8926451Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8926455Z 2025-05-07T20:32:54.8926554Z moe/activation_test.py:117: 2025-05-07T20:32:54.8926687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8926790Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8926931Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8927420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8927521Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8927874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8928093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8928432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8928522Z kernel = self.compile( 2025-05-07T20:32:54.8928904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8929115Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8929247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8929251Z 2025-05-07T20:32:54.8929457Z self = 2025-05-07T20:32:54.8930218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8930722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc02020>} 2025-05-07T20:32:54.8931453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8931643Z context = 2025-05-07T20:32:54.8931654Z 2025-05-07T20:32:54.8931818Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8932072Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8932228Z module_map=module_map) 2025-05-07T20:32:54.8932388Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8932484Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8932567Z E ^ 2025-05-07T20:32:54.8932912Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8932916Z 2025-05-07T20:32:54.8933323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8933331Z 2025-05-07T20:32:54.8933428Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8933789Z self=, 2025-05-07T20:32:54.8933878Z T=16384, 2025-05-07T20:32:54.8933954Z D=5120, 2025-05-07T20:32:54.8934035Z scale_ub=1200.0, 2025-05-07T20:32:54.8934127Z contiguous=False, 2025-05-07T20:32:54.8934210Z compiled=True, 2025-05-07T20:32:54.8934281Z ) 2025-05-07T20:32:54.8934506Z self = 2025-05-07T20:32:54.8934683Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8934687Z 2025-05-07T20:32:54.8934770Z @given( 2025-05-07T20:32:54.8934886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8934981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8935099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8935216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8935328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8935410Z ) 2025-05-07T20:32:54.8935690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8935787Z def test_silu_mul_quant( 2025-05-07T20:32:54.8935862Z self, 2025-05-07T20:32:54.8935939Z T: int, 2025-05-07T20:32:54.8936017Z D: int, 2025-05-07T20:32:54.8936112Z scale_ub: Optional[float], 2025-05-07T20:32:54.8936198Z contiguous: bool, 2025-05-07T20:32:54.8936286Z compiled: bool, 2025-05-07T20:32:54.8936366Z ) -> None: 2025-05-07T20:32:54.8936459Z torch.manual_seed(2025) 2025-05-07T20:32:54.8936539Z 2025-05-07T20:32:54.8936702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8936773Z 2025-05-07T20:32:54.8936876Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8937040Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8937127Z x = x_sign * x_clamp 2025-05-07T20:32:54.8937214Z x0 = x[:, :D] 2025-05-07T20:32:54.8937292Z x1 = x[:, D:] 2025-05-07T20:32:54.8937375Z 2025-05-07T20:32:54.8937458Z if contiguous: 2025-05-07T20:32:54.8937546Z x0 = x0.contiguous() 2025-05-07T20:32:54.8937641Z x1 = x1.contiguous() 2025-05-07T20:32:54.8937712Z 2025-05-07T20:32:54.8937800Z if scale_ub is not None: 2025-05-07T20:32:54.8937910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8938040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8938115Z ) 2025-05-07T20:32:54.8938199Z else: 2025-05-07T20:32:54.8938291Z scale_ub_tensor = None 2025-05-07T20:32:54.8938363Z 2025-05-07T20:32:54.8938499Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8938585Z op = silu_mul_quant 2025-05-07T20:32:54.8938676Z if compiled: 2025-05-07T20:32:54.8938773Z op = torch.compile(op) 2025-05-07T20:32:54.8938877Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8938954Z 2025-05-07T20:32:54.8939046Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8939050Z 2025-05-07T20:32:54.8939144Z moe/activation_test.py:117: 2025-05-07T20:32:54.8939275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8939425Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8939522Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8939890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8939980Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8940469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8940562Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8940917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8941180Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8941512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8941607Z kernel = self.compile( 2025-05-07T20:32:54.8941990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8942160Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8942290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8942295Z 2025-05-07T20:32:54.8942495Z self = 2025-05-07T20:32:54.8943256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8943800Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92bc03600>} 2025-05-07T20:32:54.8944534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8944727Z context = 2025-05-07T20:32:54.8944732Z 2025-05-07T20:32:54.8944893Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8945153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8945299Z module_map=module_map) 2025-05-07T20:32:54.8945457Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8945562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8945639Z E ^ 2025-05-07T20:32:54.8945984Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8945992Z 2025-05-07T20:32:54.8946402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8946406Z 2025-05-07T20:32:54.8946507Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8946728Z self=, 2025-05-07T20:32:54.8946804Z T=2048, 2025-05-07T20:32:54.8946883Z D=7168, 2025-05-07T20:32:54.8946972Z scale_ub=1200.0, 2025-05-07T20:32:54.8947055Z contiguous=False, 2025-05-07T20:32:54.8947135Z compiled=True, 2025-05-07T20:32:54.8947218Z ) 2025-05-07T20:32:54.8947434Z self = 2025-05-07T20:32:54.8947605Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.8947622Z 2025-05-07T20:32:54.8947697Z @given( 2025-05-07T20:32:54.8947813Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8947967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8948077Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8948193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8948311Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8948385Z ) 2025-05-07T20:32:54.8948625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8948723Z def test_silu_mul_quant( 2025-05-07T20:32:54.8948800Z self, 2025-05-07T20:32:54.8948877Z T: int, 2025-05-07T20:32:54.8948963Z D: int, 2025-05-07T20:32:54.8949060Z scale_ub: Optional[float], 2025-05-07T20:32:54.8949153Z contiguous: bool, 2025-05-07T20:32:54.8949276Z compiled: bool, 2025-05-07T20:32:54.8949354Z ) -> None: 2025-05-07T20:32:54.8949457Z torch.manual_seed(2025) 2025-05-07T20:32:54.8949528Z 2025-05-07T20:32:54.8949690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8949774Z 2025-05-07T20:32:54.8949862Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8949983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8950082Z x = x_sign * x_clamp 2025-05-07T20:32:54.8950160Z x0 = x[:, :D] 2025-05-07T20:32:54.8950238Z x1 = x[:, D:] 2025-05-07T20:32:54.8950316Z 2025-05-07T20:32:54.8950396Z if contiguous: 2025-05-07T20:32:54.8950489Z x0 = x0.contiguous() 2025-05-07T20:32:54.8950576Z x1 = x1.contiguous() 2025-05-07T20:32:54.8950645Z 2025-05-07T20:32:54.8950742Z if scale_ub is not None: 2025-05-07T20:32:54.8950841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8950997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8951110Z ) 2025-05-07T20:32:54.8951191Z else: 2025-05-07T20:32:54.8951283Z scale_ub_tensor = None 2025-05-07T20:32:54.8951356Z 2025-05-07T20:32:54.8951493Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8951581Z op = silu_mul_quant 2025-05-07T20:32:54.8951661Z if compiled: 2025-05-07T20:32:54.8951769Z op = torch.compile(op) 2025-05-07T20:32:54.8951872Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8951942Z 2025-05-07T20:32:54.8952039Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8952044Z 2025-05-07T20:32:54.8952142Z moe/activation_test.py:117: 2025-05-07T20:32:54.8952277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8952414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8952511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8952883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8952974Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8953458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8953561Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8953911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8954136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8954471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8954561Z kernel = self.compile( 2025-05-07T20:32:54.8954946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8955119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8955247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8955261Z 2025-05-07T20:32:54.8955526Z self = 2025-05-07T20:32:54.8956284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8956783Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a424720>} 2025-05-07T20:32:54.8957546Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8957747Z context = 2025-05-07T20:32:54.8957752Z 2025-05-07T20:32:54.8957912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8958168Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8958280Z module_map=module_map) 2025-05-07T20:32:54.8958438Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8958534Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8958620Z E ^ 2025-05-07T20:32:54.8958964Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8958970Z 2025-05-07T20:32:54.8959376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8959383Z 2025-05-07T20:32:54.8959485Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8959737Z self=, 2025-05-07T20:32:54.8959823Z T=1, 2025-05-07T20:32:54.8959897Z D=5120, 2025-05-07T20:32:54.8959987Z scale_ub=None, 2025-05-07T20:32:54.8960072Z contiguous=False, 2025-05-07T20:32:54.8960153Z compiled=False, 2025-05-07T20:32:54.8960230Z ) 2025-05-07T20:32:54.8960447Z self = 2025-05-07T20:32:54.8960606Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.8960611Z 2025-05-07T20:32:54.8960690Z @given( 2025-05-07T20:32:54.8960806Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8960902Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8961060Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8961174Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8961292Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8961367Z ) 2025-05-07T20:32:54.8961608Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8961703Z def test_silu_mul_quant( 2025-05-07T20:32:54.8961779Z self, 2025-05-07T20:32:54.8961853Z T: int, 2025-05-07T20:32:54.8961934Z D: int, 2025-05-07T20:32:54.8962030Z scale_ub: Optional[float], 2025-05-07T20:32:54.8962117Z contiguous: bool, 2025-05-07T20:32:54.8962205Z compiled: bool, 2025-05-07T20:32:54.8962282Z ) -> None: 2025-05-07T20:32:54.8962376Z torch.manual_seed(2025) 2025-05-07T20:32:54.8962452Z 2025-05-07T20:32:54.8962616Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8962685Z 2025-05-07T20:32:54.8962781Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8962904Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8962998Z x = x_sign * x_clamp 2025-05-07T20:32:54.8963076Z x0 = x[:, :D] 2025-05-07T20:32:54.8963157Z x1 = x[:, D:] 2025-05-07T20:32:54.8963236Z 2025-05-07T20:32:54.8963319Z if contiguous: 2025-05-07T20:32:54.8963408Z x0 = x0.contiguous() 2025-05-07T20:32:54.8963548Z x1 = x1.contiguous() 2025-05-07T20:32:54.8963620Z 2025-05-07T20:32:54.8963707Z if scale_ub is not None: 2025-05-07T20:32:54.8963818Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8963949Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8964020Z ) 2025-05-07T20:32:54.8964103Z else: 2025-05-07T20:32:54.8964193Z scale_ub_tensor = None 2025-05-07T20:32:54.8964271Z 2025-05-07T20:32:54.8964398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8964488Z op = silu_mul_quant 2025-05-07T20:32:54.8964576Z if compiled: 2025-05-07T20:32:54.8965120Z op = torch.compile(op) 2025-05-07T20:32:54.8965226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8965306Z 2025-05-07T20:32:54.8965395Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8965400Z 2025-05-07T20:32:54.8965499Z moe/activation_test.py:117: 2025-05-07T20:32:54.8965631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8965729Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8965833Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8966320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8966417Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8966776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8966998Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8967379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8967470Z kernel = self.compile( 2025-05-07T20:32:54.8967845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8968029Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8968153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8968157Z 2025-05-07T20:32:54.8968358Z self = 2025-05-07T20:32:54.8969124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8969660Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a425120>} 2025-05-07T20:32:54.8970391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8970577Z context = 2025-05-07T20:32:54.8970582Z 2025-05-07T20:32:54.8970752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8971006Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8971111Z module_map=module_map) 2025-05-07T20:32:54.8971275Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8971372Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8971450Z E ^ 2025-05-07T20:32:54.8971808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8971812Z 2025-05-07T20:32:54.8972217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8972261Z 2025-05-07T20:32:54.8972369Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8972588Z self=, 2025-05-07T20:32:54.8972662Z T=4096, 2025-05-07T20:32:54.8972742Z D=7168, 2025-05-07T20:32:54.8972824Z scale_ub=1200.0, 2025-05-07T20:32:54.8972906Z contiguous=False, 2025-05-07T20:32:54.8972993Z compiled=False, 2025-05-07T20:32:54.8973062Z ) 2025-05-07T20:32:54.8973275Z self = 2025-05-07T20:32:54.8973454Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.8973458Z 2025-05-07T20:32:54.8973570Z @given( 2025-05-07T20:32:54.8973781Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8973877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8973989Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8974113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8974221Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8974291Z ) 2025-05-07T20:32:54.8974535Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8974624Z def test_silu_mul_quant( 2025-05-07T20:32:54.8974700Z self, 2025-05-07T20:32:54.8974775Z T: int, 2025-05-07T20:32:54.8974849Z D: int, 2025-05-07T20:32:54.8974950Z scale_ub: Optional[float], 2025-05-07T20:32:54.8975034Z contiguous: bool, 2025-05-07T20:32:54.8975118Z compiled: bool, 2025-05-07T20:32:54.8975198Z ) -> None: 2025-05-07T20:32:54.8975289Z torch.manual_seed(2025) 2025-05-07T20:32:54.8975360Z 2025-05-07T20:32:54.8975575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8975646Z 2025-05-07T20:32:54.8975735Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8975863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8975948Z x = x_sign * x_clamp 2025-05-07T20:32:54.8976030Z x0 = x[:, :D] 2025-05-07T20:32:54.8976105Z x1 = x[:, D:] 2025-05-07T20:32:54.8976174Z 2025-05-07T20:32:54.8976260Z if contiguous: 2025-05-07T20:32:54.8976348Z x0 = x0.contiguous() 2025-05-07T20:32:54.8976435Z x1 = x1.contiguous() 2025-05-07T20:32:54.8976510Z 2025-05-07T20:32:54.8976599Z if scale_ub is not None: 2025-05-07T20:32:54.8976700Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8976879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8976951Z ) 2025-05-07T20:32:54.8977022Z else: 2025-05-07T20:32:54.8977121Z scale_ub_tensor = None 2025-05-07T20:32:54.8977195Z 2025-05-07T20:32:54.8977320Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8977411Z op = silu_mul_quant 2025-05-07T20:32:54.8977495Z if compiled: 2025-05-07T20:32:54.8977594Z op = torch.compile(op) 2025-05-07T20:32:54.8977697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8977768Z 2025-05-07T20:32:54.8977861Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8977865Z 2025-05-07T20:32:54.8977959Z moe/activation_test.py:117: 2025-05-07T20:32:54.8978086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8978189Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8978287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8978777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8978874Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8979231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8979497Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8979830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8979920Z kernel = self.compile( 2025-05-07T20:32:54.8980307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8980478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8980608Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8980616Z 2025-05-07T20:32:54.8980818Z self = 2025-05-07T20:32:54.8981617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8982123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a426480>} 2025-05-07T20:32:54.8982851Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8983044Z context = 2025-05-07T20:32:54.8983049Z 2025-05-07T20:32:54.8983211Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8983468Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8983616Z module_map=module_map) 2025-05-07T20:32:54.8983774Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8983879Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8983953Z E ^ 2025-05-07T20:32:54.8984299Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8984304Z 2025-05-07T20:32:54.8984719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8984723Z 2025-05-07T20:32:54.8984825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8985048Z self=, 2025-05-07T20:32:54.8985183Z T=16384, 2025-05-07T20:32:54.8985255Z D=7168, 2025-05-07T20:32:54.8985342Z scale_ub=None, 2025-05-07T20:32:54.8985426Z contiguous=True, 2025-05-07T20:32:54.8985512Z compiled=True, 2025-05-07T20:32:54.8985589Z ) 2025-05-07T20:32:54.8985805Z self = 2025-05-07T20:32:54.8985973Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.8985980Z 2025-05-07T20:32:54.8986061Z @given( 2025-05-07T20:32:54.8986180Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.8986286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.8986401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.8986516Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.8986632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.8986705Z ) 2025-05-07T20:32:54.8986945Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.8987044Z def test_silu_mul_quant( 2025-05-07T20:32:54.8987116Z self, 2025-05-07T20:32:54.8987190Z T: int, 2025-05-07T20:32:54.8987268Z D: int, 2025-05-07T20:32:54.8987364Z scale_ub: Optional[float], 2025-05-07T20:32:54.8987450Z contiguous: bool, 2025-05-07T20:32:54.8987540Z compiled: bool, 2025-05-07T20:32:54.8987662Z ) -> None: 2025-05-07T20:32:54.8987760Z torch.manual_seed(2025) 2025-05-07T20:32:54.8987831Z 2025-05-07T20:32:54.8987993Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.8988071Z 2025-05-07T20:32:54.8988159Z x_sign = torch.sign(x) 2025-05-07T20:32:54.8988280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.8988369Z x = x_sign * x_clamp 2025-05-07T20:32:54.8988446Z x0 = x[:, :D] 2025-05-07T20:32:54.8988525Z x1 = x[:, D:] 2025-05-07T20:32:54.8988601Z 2025-05-07T20:32:54.8988683Z if contiguous: 2025-05-07T20:32:54.8988771Z x0 = x0.contiguous() 2025-05-07T20:32:54.8988864Z x1 = x1.contiguous() 2025-05-07T20:32:54.8988974Z 2025-05-07T20:32:54.8989067Z if scale_ub is not None: 2025-05-07T20:32:54.8989176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.8989309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.8989391Z ) 2025-05-07T20:32:54.8989466Z else: 2025-05-07T20:32:54.8989555Z scale_ub_tensor = None 2025-05-07T20:32:54.8989631Z 2025-05-07T20:32:54.8989755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.8989840Z op = silu_mul_quant 2025-05-07T20:32:54.8989928Z if compiled: 2025-05-07T20:32:54.8990024Z op = torch.compile(op) 2025-05-07T20:32:54.8990126Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8990201Z 2025-05-07T20:32:54.8990287Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.8990293Z 2025-05-07T20:32:54.8990392Z moe/activation_test.py:117: 2025-05-07T20:32:54.8990520Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8990658Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.8990762Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.8991118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.8991211Z return fn(*args, **kwargs) 2025-05-07T20:32:54.8991702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.8991798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.8992154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.8992369Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.8992741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.8992841Z kernel = self.compile( 2025-05-07T20:32:54.8993219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.8993390Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.8993522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.8993527Z 2025-05-07T20:32:54.8993726Z self = 2025-05-07T20:32:54.8994490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.8994981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a427740>} 2025-05-07T20:32:54.8995722Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.8995951Z context = 2025-05-07T20:32:54.8995955Z 2025-05-07T20:32:54.8996115Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.8996379Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.8996483Z module_map=module_map) 2025-05-07T20:32:54.8996649Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.8996745Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.8996821Z E ^ 2025-05-07T20:32:54.8997174Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.8997178Z 2025-05-07T20:32:54.8997623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.8997628Z 2025-05-07T20:32:54.8997729Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.8997957Z self=, 2025-05-07T20:32:54.8998035Z T=4096, 2025-05-07T20:32:54.8998114Z D=5120, 2025-05-07T20:32:54.8998464Z scale_ub=None, 2025-05-07T20:32:54.8998594Z contiguous=False, 2025-05-07T20:32:54.8998714Z compiled=True, 2025-05-07T20:32:54.8998786Z ) 2025-05-07T20:32:54.9000201Z self = 2025-05-07T20:32:54.9000505Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9000511Z 2025-05-07T20:32:54.9000635Z @given( 2025-05-07T20:32:54.9000770Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9000897Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9001342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9001468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9001581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9001664Z ) 2025-05-07T20:32:54.9001920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9002015Z def test_silu_mul_quant( 2025-05-07T20:32:54.9002095Z self, 2025-05-07T20:32:54.9002182Z T: int, 2025-05-07T20:32:54.9002257Z D: int, 2025-05-07T20:32:54.9002353Z scale_ub: Optional[float], 2025-05-07T20:32:54.9002446Z contiguous: bool, 2025-05-07T20:32:54.9002529Z compiled: bool, 2025-05-07T20:32:54.9002612Z ) -> None: 2025-05-07T20:32:54.9002712Z torch.manual_seed(2025) 2025-05-07T20:32:54.9002869Z 2025-05-07T20:32:54.9003047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9003122Z 2025-05-07T20:32:54.9003214Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9003343Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9003429Z x = x_sign * x_clamp 2025-05-07T20:32:54.9003506Z x0 = x[:, :D] 2025-05-07T20:32:54.9003591Z x1 = x[:, D:] 2025-05-07T20:32:54.9003660Z 2025-05-07T20:32:54.9003740Z if contiguous: 2025-05-07T20:32:54.9003834Z x0 = x0.contiguous() 2025-05-07T20:32:54.9003919Z x1 = x1.contiguous() 2025-05-07T20:32:54.9003985Z 2025-05-07T20:32:54.9004075Z if scale_ub is not None: 2025-05-07T20:32:54.9004178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9004317Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9004392Z ) 2025-05-07T20:32:54.9004470Z else: 2025-05-07T20:32:54.9004567Z scale_ub_tensor = None 2025-05-07T20:32:54.9004638Z 2025-05-07T20:32:54.9004771Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9004866Z op = silu_mul_quant 2025-05-07T20:32:54.9004948Z if compiled: 2025-05-07T20:32:54.9005045Z op = torch.compile(op) 2025-05-07T20:32:54.9005256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9005326Z 2025-05-07T20:32:54.9005413Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9005418Z 2025-05-07T20:32:54.9005520Z moe/activation_test.py:117: 2025-05-07T20:32:54.9005651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9005753Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9005850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9006222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9006325Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9006904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9007006Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9007366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9007592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9007929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9008020Z kernel = self.compile( 2025-05-07T20:32:54.9008397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9008576Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9008707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9008712Z 2025-05-07T20:32:54.9008925Z self = 2025-05-07T20:32:54.9009729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9010226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeacc20>} 2025-05-07T20:32:54.9011016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9011205Z context = 2025-05-07T20:32:54.9011268Z 2025-05-07T20:32:54.9011437Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9011695Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9011804Z module_map=module_map) 2025-05-07T20:32:54.9011970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9012067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9012149Z E ^ 2025-05-07T20:32:54.9012495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9012500Z 2025-05-07T20:32:54.9012900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9012905Z 2025-05-07T20:32:54.9013011Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9013226Z self=, 2025-05-07T20:32:54.9013318Z T=4096, 2025-05-07T20:32:54.9018243Z D=5120, 2025-05-07T20:32:54.9018348Z scale_ub=1200.0, 2025-05-07T20:32:54.9018451Z contiguous=False, 2025-05-07T20:32:54.9018549Z compiled=False, 2025-05-07T20:32:54.9018632Z ) 2025-05-07T20:32:54.9018870Z self = 2025-05-07T20:32:54.9019126Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9019131Z 2025-05-07T20:32:54.9019224Z @given( 2025-05-07T20:32:54.9019351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9019456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9019585Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9019708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9019827Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9019916Z ) 2025-05-07T20:32:54.9020170Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9020269Z def test_silu_mul_quant( 2025-05-07T20:32:54.9020404Z self, 2025-05-07T20:32:54.9020491Z T: int, 2025-05-07T20:32:54.9020580Z D: int, 2025-05-07T20:32:54.9020685Z scale_ub: Optional[float], 2025-05-07T20:32:54.9020779Z contiguous: bool, 2025-05-07T20:32:54.9020880Z compiled: bool, 2025-05-07T20:32:54.9020966Z ) -> None: 2025-05-07T20:32:54.9021066Z torch.manual_seed(2025) 2025-05-07T20:32:54.9021150Z 2025-05-07T20:32:54.9021326Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9021407Z 2025-05-07T20:32:54.9021515Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9021645Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9021742Z x = x_sign * x_clamp 2025-05-07T20:32:54.9021836Z x0 = x[:, :D] 2025-05-07T20:32:54.9021922Z x1 = x[:, D:] 2025-05-07T20:32:54.9022004Z 2025-05-07T20:32:54.9022092Z if contiguous: 2025-05-07T20:32:54.9022190Z x0 = x0.contiguous() 2025-05-07T20:32:54.9022293Z x1 = x1.contiguous() 2025-05-07T20:32:54.9022413Z 2025-05-07T20:32:54.9022511Z if scale_ub is not None: 2025-05-07T20:32:54.9022632Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9022774Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9022854Z ) 2025-05-07T20:32:54.9022946Z else: 2025-05-07T20:32:54.9023048Z scale_ub_tensor = None 2025-05-07T20:32:54.9023126Z 2025-05-07T20:32:54.9023269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9023365Z op = silu_mul_quant 2025-05-07T20:32:54.9023454Z if compiled: 2025-05-07T20:32:54.9023572Z op = torch.compile(op) 2025-05-07T20:32:54.9023682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9023815Z 2025-05-07T20:32:54.9023914Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9023919Z 2025-05-07T20:32:54.9024025Z moe/activation_test.py:117: 2025-05-07T20:32:54.9024173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9024280Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9024385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9024898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9025002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9025376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9025607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9025951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9026061Z kernel = self.compile( 2025-05-07T20:32:54.9026452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9026634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9026778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9026828Z 2025-05-07T20:32:54.9027039Z self = 2025-05-07T20:32:54.9027824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9028331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aead6c0>} 2025-05-07T20:32:54.9029127Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9029323Z context = 2025-05-07T20:32:54.9029328Z 2025-05-07T20:32:54.9029502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9029774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9029886Z module_map=module_map) 2025-05-07T20:32:54.9030060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9030166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9030249Z E ^ 2025-05-07T20:32:54.9030615Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9030623Z 2025-05-07T20:32:54.9031035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9031040Z 2025-05-07T20:32:54.9031190Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9031425Z self=, 2025-05-07T20:32:54.9031509Z T=4096, 2025-05-07T20:32:54.9031603Z D=5120, 2025-05-07T20:32:54.9031693Z scale_ub=1200.0, 2025-05-07T20:32:54.9031785Z contiguous=False, 2025-05-07T20:32:54.9031882Z compiled=True, 2025-05-07T20:32:54.9031961Z ) 2025-05-07T20:32:54.9032184Z self = 2025-05-07T20:32:54.9032376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9032380Z 2025-05-07T20:32:54.9032463Z @given( 2025-05-07T20:32:54.9032590Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9032740Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9032859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9032991Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9033111Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9033198Z ) 2025-05-07T20:32:54.9033449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9033551Z def test_silu_mul_quant( 2025-05-07T20:32:54.9033640Z self, 2025-05-07T20:32:54.9033721Z T: int, 2025-05-07T20:32:54.9033804Z D: int, 2025-05-07T20:32:54.9033919Z scale_ub: Optional[float], 2025-05-07T20:32:54.9034012Z contiguous: bool, 2025-05-07T20:32:54.9034100Z compiled: bool, 2025-05-07T20:32:54.9034189Z ) -> None: 2025-05-07T20:32:54.9034288Z torch.manual_seed(2025) 2025-05-07T20:32:54.9034365Z 2025-05-07T20:32:54.9034543Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9034623Z 2025-05-07T20:32:54.9034726Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9034856Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9034951Z x = x_sign * x_clamp 2025-05-07T20:32:54.9035043Z x0 = x[:, :D] 2025-05-07T20:32:54.9035126Z x1 = x[:, D:] 2025-05-07T20:32:54.9035202Z 2025-05-07T20:32:54.9035369Z if contiguous: 2025-05-07T20:32:54.9035466Z x0 = x0.contiguous() 2025-05-07T20:32:54.9035560Z x1 = x1.contiguous() 2025-05-07T20:32:54.9035644Z 2025-05-07T20:32:54.9035742Z if scale_ub is not None: 2025-05-07T20:32:54.9035852Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9035999Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9036078Z ) 2025-05-07T20:32:54.9036157Z else: 2025-05-07T20:32:54.9036262Z scale_ub_tensor = None 2025-05-07T20:32:54.9036343Z 2025-05-07T20:32:54.9036485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9036578Z op = silu_mul_quant 2025-05-07T20:32:54.9036707Z if compiled: 2025-05-07T20:32:54.9036819Z op = torch.compile(op) 2025-05-07T20:32:54.9036929Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9037004Z 2025-05-07T20:32:54.9037108Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9037113Z 2025-05-07T20:32:54.9037211Z moe/activation_test.py:117: 2025-05-07T20:32:54.9037343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9037453Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9037557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9037931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9038027Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9038518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9038633Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9039036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9039261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9039611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9039707Z kernel = self.compile( 2025-05-07T20:32:54.9040097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9040274Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9040408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9040412Z 2025-05-07T20:32:54.9040667Z self = 2025-05-07T20:32:54.9041443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9041951Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeaefc0>} 2025-05-07T20:32:54.9042695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9042899Z context = 2025-05-07T20:32:54.9042903Z 2025-05-07T20:32:54.9043072Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9043337Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9043457Z module_map=module_map) 2025-05-07T20:32:54.9043624Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9043725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9043815Z E ^ 2025-05-07T20:32:54.9044206Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9044211Z 2025-05-07T20:32:54.9044626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9044631Z 2025-05-07T20:32:54.9044736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9044959Z self=, 2025-05-07T20:32:54.9045046Z T=2048, 2025-05-07T20:32:54.9045125Z D=7168, 2025-05-07T20:32:54.9045214Z scale_ub=1200.0, 2025-05-07T20:32:54.9045310Z contiguous=False, 2025-05-07T20:32:54.9045397Z compiled=False, 2025-05-07T20:32:54.9045519Z ) 2025-05-07T20:32:54.9045742Z self = 2025-05-07T20:32:54.9045920Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9045928Z 2025-05-07T20:32:54.9046013Z @given( 2025-05-07T20:32:54.9046138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9046244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9046369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9046490Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9046606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9046688Z ) 2025-05-07T20:32:54.9046934Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9047039Z def test_silu_mul_quant( 2025-05-07T20:32:54.9047119Z self, 2025-05-07T20:32:54.9047199Z T: int, 2025-05-07T20:32:54.9047290Z D: int, 2025-05-07T20:32:54.9047393Z scale_ub: Optional[float], 2025-05-07T20:32:54.9047529Z contiguous: bool, 2025-05-07T20:32:54.9047626Z compiled: bool, 2025-05-07T20:32:54.9047707Z ) -> None: 2025-05-07T20:32:54.9047807Z torch.manual_seed(2025) 2025-05-07T20:32:54.9047892Z 2025-05-07T20:32:54.9048063Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9048141Z 2025-05-07T20:32:54.9048244Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9048372Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9048471Z x = x_sign * x_clamp 2025-05-07T20:32:54.9048556Z x0 = x[:, :D] 2025-05-07T20:32:54.9048639Z x1 = x[:, D:] 2025-05-07T20:32:54.9048721Z 2025-05-07T20:32:54.9048808Z if contiguous: 2025-05-07T20:32:54.9048945Z x0 = x0.contiguous() 2025-05-07T20:32:54.9049044Z x1 = x1.contiguous() 2025-05-07T20:32:54.9049119Z 2025-05-07T20:32:54.9049217Z if scale_ub is not None: 2025-05-07T20:32:54.9049337Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9049476Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9049556Z ) 2025-05-07T20:32:54.9049645Z else: 2025-05-07T20:32:54.9049743Z scale_ub_tensor = None 2025-05-07T20:32:54.9049819Z 2025-05-07T20:32:54.9049962Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9050056Z op = silu_mul_quant 2025-05-07T20:32:54.9050152Z if compiled: 2025-05-07T20:32:54.9050257Z op = torch.compile(op) 2025-05-07T20:32:54.9050367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9050448Z 2025-05-07T20:32:54.9050549Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9050554Z 2025-05-07T20:32:54.9050671Z moe/activation_test.py:117: 2025-05-07T20:32:54.9050836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9050940Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9051045Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9051544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9051688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9052053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9052278Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9052617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9052725Z kernel = self.compile( 2025-05-07T20:32:54.9053110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9053337Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9053473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9053478Z 2025-05-07T20:32:54.9053777Z self = 2025-05-07T20:32:54.9054554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9055056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92aeafec0>} 2025-05-07T20:32:54.9055798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9055999Z context = 2025-05-07T20:32:54.9056047Z 2025-05-07T20:32:54.9056216Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9056486Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9056599Z module_map=module_map) 2025-05-07T20:32:54.9056772Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9056873Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9056953Z E ^ 2025-05-07T20:32:54.9057308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9057313Z 2025-05-07T20:32:54.9057720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9057764Z 2025-05-07T20:32:54.9057877Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9058107Z self=, 2025-05-07T20:32:54.9058186Z T=1, 2025-05-07T20:32:54.9058270Z D=7168, 2025-05-07T20:32:54.9058355Z scale_ub=None, 2025-05-07T20:32:54.9058445Z contiguous=True, 2025-05-07T20:32:54.9058537Z compiled=False, 2025-05-07T20:32:54.9058614Z ) 2025-05-07T20:32:54.9058833Z self = 2025-05-07T20:32:54.9059008Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9059013Z 2025-05-07T20:32:54.9059091Z @given( 2025-05-07T20:32:54.9059224Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9059325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9059442Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9059571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9059687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9059769Z ) 2025-05-07T20:32:54.9060022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9060119Z def test_silu_mul_quant( 2025-05-07T20:32:54.9060244Z self, 2025-05-07T20:32:54.9060330Z T: int, 2025-05-07T20:32:54.9060415Z D: int, 2025-05-07T20:32:54.9060535Z scale_ub: Optional[float], 2025-05-07T20:32:54.9060651Z contiguous: bool, 2025-05-07T20:32:54.9060750Z compiled: bool, 2025-05-07T20:32:54.9060838Z ) -> None: 2025-05-07T20:32:54.9060936Z torch.manual_seed(2025) 2025-05-07T20:32:54.9061014Z 2025-05-07T20:32:54.9061189Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9061265Z 2025-05-07T20:32:54.9061361Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9061498Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9061592Z x = x_sign * x_clamp 2025-05-07T20:32:54.9061715Z x0 = x[:, :D] 2025-05-07T20:32:54.9061807Z x1 = x[:, D:] 2025-05-07T20:32:54.9061884Z 2025-05-07T20:32:54.9061971Z if contiguous: 2025-05-07T20:32:54.9062077Z x0 = x0.contiguous() 2025-05-07T20:32:54.9062174Z x1 = x1.contiguous() 2025-05-07T20:32:54.9062256Z 2025-05-07T20:32:54.9062350Z if scale_ub is not None: 2025-05-07T20:32:54.9062461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9062604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9062682Z ) 2025-05-07T20:32:54.9062761Z else: 2025-05-07T20:32:54.9062865Z scale_ub_tensor = None 2025-05-07T20:32:54.9062941Z 2025-05-07T20:32:54.9063074Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9063178Z op = silu_mul_quant 2025-05-07T20:32:54.9063269Z if compiled: 2025-05-07T20:32:54.9063372Z op = torch.compile(op) 2025-05-07T20:32:54.9063491Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9063567Z 2025-05-07T20:32:54.9063712Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9063717Z 2025-05-07T20:32:54.9063818Z moe/activation_test.py:117: 2025-05-07T20:32:54.9063953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9064061Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9064165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9064653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9064759Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9065117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9065409Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9065751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9065852Z kernel = self.compile( 2025-05-07T20:32:54.9066240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9066419Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9066549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9066560Z 2025-05-07T20:32:54.9066767Z self = 2025-05-07T20:32:54.9067527Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9068037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26ccc0>} 2025-05-07T20:32:54.9068774Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9069019Z context = 2025-05-07T20:32:54.9069024Z 2025-05-07T20:32:54.9069189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9069448Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9069564Z module_map=module_map) 2025-05-07T20:32:54.9069726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9069826Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9069916Z E ^ 2025-05-07T20:32:54.9070326Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9070335Z 2025-05-07T20:32:54.9070749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9070757Z 2025-05-07T20:32:54.9070862Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9071084Z self=, 2025-05-07T20:32:54.9071172Z T=16384, 2025-05-07T20:32:54.9071252Z D=7168, 2025-05-07T20:32:54.9071345Z scale_ub=1200.0, 2025-05-07T20:32:54.9071434Z contiguous=False, 2025-05-07T20:32:54.9071519Z compiled=True, 2025-05-07T20:32:54.9071603Z ) 2025-05-07T20:32:54.9071820Z self = 2025-05-07T20:32:54.9071996Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9072004Z 2025-05-07T20:32:54.9072089Z @given( 2025-05-07T20:32:54.9072212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9072354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9072481Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9072599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9072723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9072799Z ) 2025-05-07T20:32:54.9073045Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9073146Z def test_silu_mul_quant( 2025-05-07T20:32:54.9073223Z self, 2025-05-07T20:32:54.9073303Z T: int, 2025-05-07T20:32:54.9073387Z D: int, 2025-05-07T20:32:54.9073486Z scale_ub: Optional[float], 2025-05-07T20:32:54.9073577Z contiguous: bool, 2025-05-07T20:32:54.9073670Z compiled: bool, 2025-05-07T20:32:54.9073791Z ) -> None: 2025-05-07T20:32:54.9073886Z torch.manual_seed(2025) 2025-05-07T20:32:54.9073966Z 2025-05-07T20:32:54.9074138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9074222Z 2025-05-07T20:32:54.9074315Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9074439Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9074538Z x = x_sign * x_clamp 2025-05-07T20:32:54.9074619Z x0 = x[:, :D] 2025-05-07T20:32:54.9074700Z x1 = x[:, D:] 2025-05-07T20:32:54.9074779Z 2025-05-07T20:32:54.9074863Z if contiguous: 2025-05-07T20:32:54.9074955Z x0 = x0.contiguous() 2025-05-07T20:32:54.9075053Z x1 = x1.contiguous() 2025-05-07T20:32:54.9075126Z 2025-05-07T20:32:54.9075218Z if scale_ub is not None: 2025-05-07T20:32:54.9075332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9075466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9075547Z ) 2025-05-07T20:32:54.9075632Z else: 2025-05-07T20:32:54.9075730Z scale_ub_tensor = None 2025-05-07T20:32:54.9075810Z 2025-05-07T20:32:54.9075944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9076034Z op = silu_mul_quant 2025-05-07T20:32:54.9076127Z if compiled: 2025-05-07T20:32:54.9076273Z op = torch.compile(op) 2025-05-07T20:32:54.9076381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9076459Z 2025-05-07T20:32:54.9076563Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9076567Z 2025-05-07T20:32:54.9076665Z moe/activation_test.py:117: 2025-05-07T20:32:54.9076794Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9076903Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9077004Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9077370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9077475Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9078007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9078114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9078475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9078699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9079046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9079142Z kernel = self.compile( 2025-05-07T20:32:54.9079524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9079708Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9079840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9079846Z 2025-05-07T20:32:54.9080101Z self = 2025-05-07T20:32:54.9080912Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9081421Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26e0c0>} 2025-05-07T20:32:54.9082157Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9082387Z context = 2025-05-07T20:32:54.9082391Z 2025-05-07T20:32:54.9082566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9082827Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9082941Z module_map=module_map) 2025-05-07T20:32:54.9083104Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9083207Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9083295Z E ^ 2025-05-07T20:32:54.9083646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9083650Z 2025-05-07T20:32:54.9084054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9084071Z 2025-05-07T20:32:54.9084178Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9084402Z self=, 2025-05-07T20:32:54.9084491Z T=1, 2025-05-07T20:32:54.9084569Z D=7168, 2025-05-07T20:32:54.9084656Z scale_ub=None, 2025-05-07T20:32:54.9084755Z contiguous=False, 2025-05-07T20:32:54.9084840Z compiled=False, 2025-05-07T20:32:54.9084917Z ) 2025-05-07T20:32:54.9085184Z self = 2025-05-07T20:32:54.9085348Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9085353Z 2025-05-07T20:32:54.9085435Z @given( 2025-05-07T20:32:54.9085556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9085656Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9085778Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9085895Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9086009Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9086094Z ) 2025-05-07T20:32:54.9086377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9086471Z def test_silu_mul_quant( 2025-05-07T20:32:54.9086559Z self, 2025-05-07T20:32:54.9086636Z T: int, 2025-05-07T20:32:54.9086716Z D: int, 2025-05-07T20:32:54.9086824Z scale_ub: Optional[float], 2025-05-07T20:32:54.9086916Z contiguous: bool, 2025-05-07T20:32:54.9087008Z compiled: bool, 2025-05-07T20:32:54.9087088Z ) -> None: 2025-05-07T20:32:54.9087182Z torch.manual_seed(2025) 2025-05-07T20:32:54.9087262Z 2025-05-07T20:32:54.9087430Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9087505Z 2025-05-07T20:32:54.9087603Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9087729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9087819Z x = x_sign * x_clamp 2025-05-07T20:32:54.9087910Z x0 = x[:, :D] 2025-05-07T20:32:54.9087993Z x1 = x[:, D:] 2025-05-07T20:32:54.9088066Z 2025-05-07T20:32:54.9088160Z if contiguous: 2025-05-07T20:32:54.9088298Z x0 = x0.contiguous() 2025-05-07T20:32:54.9088396Z x1 = x1.contiguous() 2025-05-07T20:32:54.9088469Z 2025-05-07T20:32:54.9088563Z if scale_ub is not None: 2025-05-07T20:32:54.9088676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9088811Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9088893Z ) 2025-05-07T20:32:54.9088975Z else: 2025-05-07T20:32:54.9089069Z scale_ub_tensor = None 2025-05-07T20:32:54.9089143Z 2025-05-07T20:32:54.9089284Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9089374Z op = silu_mul_quant 2025-05-07T20:32:54.9089461Z if compiled: 2025-05-07T20:32:54.9089569Z op = torch.compile(op) 2025-05-07T20:32:54.9089718Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9089793Z 2025-05-07T20:32:54.9089896Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9089900Z 2025-05-07T20:32:54.9090000Z moe/activation_test.py:117: 2025-05-07T20:32:54.9090138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9090240Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9090345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9090842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9090940Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9091296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9091525Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9091861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9091965Z kernel = self.compile( 2025-05-07T20:32:54.9092346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9092519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9092694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9092698Z 2025-05-07T20:32:54.9092905Z self = 2025-05-07T20:32:54.9093751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9094248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92b26ec00>} 2025-05-07T20:32:54.9095034Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9095227Z context = 2025-05-07T20:32:54.9095235Z 2025-05-07T20:32:54.9095401Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9095665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9095775Z module_map=module_map) 2025-05-07T20:32:54.9095937Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9096042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9096119Z E ^ 2025-05-07T20:32:54.9096475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9096482Z 2025-05-07T20:32:54.9096951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9096956Z 2025-05-07T20:32:54.9097060Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9097288Z self=, 2025-05-07T20:32:54.9097368Z T=2048, 2025-05-07T20:32:54.9097452Z D=7168, 2025-05-07T20:32:54.9097537Z scale_ub=None, 2025-05-07T20:32:54.9097623Z contiguous=False, 2025-05-07T20:32:54.9097717Z compiled=True, 2025-05-07T20:32:54.9097791Z ) 2025-05-07T20:32:54.9098007Z self = 2025-05-07T20:32:54.9098787Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9098808Z 2025-05-07T20:32:54.9098916Z @given( 2025-05-07T20:32:54.9099198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9099307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9099428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9099549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9099671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9099751Z ) 2025-05-07T20:32:54.9100005Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9100102Z def test_silu_mul_quant( 2025-05-07T20:32:54.9100179Z self, 2025-05-07T20:32:54.9100265Z T: int, 2025-05-07T20:32:54.9100358Z D: int, 2025-05-07T20:32:54.9100466Z scale_ub: Optional[float], 2025-05-07T20:32:54.9100583Z contiguous: bool, 2025-05-07T20:32:54.9100673Z compiled: bool, 2025-05-07T20:32:54.9100755Z ) -> None: 2025-05-07T20:32:54.9100857Z torch.manual_seed(2025) 2025-05-07T20:32:54.9100936Z 2025-05-07T20:32:54.9101111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9101190Z 2025-05-07T20:32:54.9101286Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9101419Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9101509Z x = x_sign * x_clamp 2025-05-07T20:32:54.9101591Z x0 = x[:, :D] 2025-05-07T20:32:54.9101753Z x1 = x[:, D:] 2025-05-07T20:32:54.9101828Z 2025-05-07T20:32:54.9101914Z if contiguous: 2025-05-07T20:32:54.9102012Z x0 = x0.contiguous() 2025-05-07T20:32:54.9102102Z x1 = x1.contiguous() 2025-05-07T20:32:54.9102175Z 2025-05-07T20:32:54.9102273Z if scale_ub is not None: 2025-05-07T20:32:54.9102384Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9102520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9102603Z ) 2025-05-07T20:32:54.9102680Z else: 2025-05-07T20:32:54.9102784Z scale_ub_tensor = None 2025-05-07T20:32:54.9102859Z 2025-05-07T20:32:54.9102989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9103158Z op = silu_mul_quant 2025-05-07T20:32:54.9103247Z if compiled: 2025-05-07T20:32:54.9103348Z op = torch.compile(op) 2025-05-07T20:32:54.9103459Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9103536Z 2025-05-07T20:32:54.9103628Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9103633Z 2025-05-07T20:32:54.9103739Z moe/activation_test.py:117: 2025-05-07T20:32:54.9103873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9103979Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9104079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9104450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9104550Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9105046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9105210Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9105573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9105796Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9106139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9106235Z kernel = self.compile( 2025-05-07T20:32:54.9106617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9106798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9106927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9106974Z 2025-05-07T20:32:54.9107180Z self = 2025-05-07T20:32:54.9107958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9108459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a6002c0>} 2025-05-07T20:32:54.9109201Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9109392Z context = 2025-05-07T20:32:54.9109396Z 2025-05-07T20:32:54.9109567Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9109829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9109939Z module_map=module_map) 2025-05-07T20:32:54.9110108Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9110249Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9110327Z E ^ 2025-05-07T20:32:54.9110686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9110690Z 2025-05-07T20:32:54.9111101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9111105Z 2025-05-07T20:32:54.9111219Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9111437Z self=, 2025-05-07T20:32:54.9111517Z T=4096, 2025-05-07T20:32:54.9111606Z D=7168, 2025-05-07T20:32:54.9111690Z scale_ub=None, 2025-05-07T20:32:54.9111817Z contiguous=False, 2025-05-07T20:32:54.9111917Z compiled=True, 2025-05-07T20:32:54.9111995Z ) 2025-05-07T20:32:54.9112221Z self = 2025-05-07T20:32:54.9112395Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9112402Z 2025-05-07T20:32:54.9112479Z @given( 2025-05-07T20:32:54.9112608Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9112710Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9112824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9112950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9113065Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9113147Z ) 2025-05-07T20:32:54.9113393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9113491Z def test_silu_mul_quant( 2025-05-07T20:32:54.9113579Z self, 2025-05-07T20:32:54.9113661Z T: int, 2025-05-07T20:32:54.9113739Z D: int, 2025-05-07T20:32:54.9113888Z scale_ub: Optional[float], 2025-05-07T20:32:54.9113979Z contiguous: bool, 2025-05-07T20:32:54.9114067Z compiled: bool, 2025-05-07T20:32:54.9114157Z ) -> None: 2025-05-07T20:32:54.9114253Z torch.manual_seed(2025) 2025-05-07T20:32:54.9114329Z 2025-05-07T20:32:54.9114503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9114578Z 2025-05-07T20:32:54.9114672Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9114803Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9114893Z x = x_sign * x_clamp 2025-05-07T20:32:54.9114978Z x0 = x[:, :D] 2025-05-07T20:32:54.9115060Z x1 = x[:, D:] 2025-05-07T20:32:54.9115177Z 2025-05-07T20:32:54.9115268Z if contiguous: 2025-05-07T20:32:54.9115364Z x0 = x0.contiguous() 2025-05-07T20:32:54.9115457Z x1 = x1.contiguous() 2025-05-07T20:32:54.9115541Z 2025-05-07T20:32:54.9115638Z if scale_ub is not None: 2025-05-07T20:32:54.9115745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9115886Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9115966Z ) 2025-05-07T20:32:54.9116045Z else: 2025-05-07T20:32:54.9116148Z scale_ub_tensor = None 2025-05-07T20:32:54.9116221Z 2025-05-07T20:32:54.9116358Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9116448Z op = silu_mul_quant 2025-05-07T20:32:54.9116535Z if compiled: 2025-05-07T20:32:54.9116642Z op = torch.compile(op) 2025-05-07T20:32:54.9116747Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9116821Z 2025-05-07T20:32:54.9116927Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9116932Z 2025-05-07T20:32:54.9117030Z moe/activation_test.py:117: 2025-05-07T20:32:54.9117162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9117275Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9117376Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9117750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9117892Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9118380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9118488Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9118843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9119065Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9119416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9119551Z kernel = self.compile( 2025-05-07T20:32:54.9119940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9120116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9120251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9120255Z 2025-05-07T20:32:54.9120483Z self = 2025-05-07T20:32:54.9121275Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9121779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a600d60>} 2025-05-07T20:32:54.9122572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9122768Z context = 2025-05-07T20:32:54.9122779Z 2025-05-07T20:32:54.9122945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9123204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9123319Z module_map=module_map) 2025-05-07T20:32:54.9123481Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9123580Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9123670Z E ^ 2025-05-07T20:32:54.9124060Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9124067Z 2025-05-07T20:32:54.9124481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9124485Z 2025-05-07T20:32:54.9124589Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9124813Z self=, 2025-05-07T20:32:54.9124902Z T=16384, 2025-05-07T20:32:54.9124980Z D=5120, 2025-05-07T20:32:54.9125064Z scale_ub=1200.0, 2025-05-07T20:32:54.9125158Z contiguous=False, 2025-05-07T20:32:54.9125246Z compiled=False, 2025-05-07T20:32:54.9125324Z ) 2025-05-07T20:32:54.9125550Z self = 2025-05-07T20:32:54.9125730Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9125737Z 2025-05-07T20:32:54.9125823Z @given( 2025-05-07T20:32:54.9125945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9126046Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9126172Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9126292Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9126449Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9126533Z ) 2025-05-07T20:32:54.9126777Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9126880Z def test_silu_mul_quant( 2025-05-07T20:32:54.9126959Z self, 2025-05-07T20:32:54.9127042Z T: int, 2025-05-07T20:32:54.9127131Z D: int, 2025-05-07T20:32:54.9127232Z scale_ub: Optional[float], 2025-05-07T20:32:54.9127320Z contiguous: bool, 2025-05-07T20:32:54.9127410Z compiled: bool, 2025-05-07T20:32:54.9127489Z ) -> None: 2025-05-07T20:32:54.9127588Z torch.manual_seed(2025) 2025-05-07T20:32:54.9127667Z 2025-05-07T20:32:54.9127902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9127976Z 2025-05-07T20:32:54.9128078Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9128201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9128289Z x = x_sign * x_clamp 2025-05-07T20:32:54.9128376Z x0 = x[:, :D] 2025-05-07T20:32:54.9128455Z x1 = x[:, D:] 2025-05-07T20:32:54.9128532Z 2025-05-07T20:32:54.9128616Z if contiguous: 2025-05-07T20:32:54.9128708Z x0 = x0.contiguous() 2025-05-07T20:32:54.9128802Z x1 = x1.contiguous() 2025-05-07T20:32:54.9128874Z 2025-05-07T20:32:54.9128965Z if scale_ub is not None: 2025-05-07T20:32:54.9129077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9129211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9129292Z ) 2025-05-07T20:32:54.9129372Z else: 2025-05-07T20:32:54.9129466Z scale_ub_tensor = None 2025-05-07T20:32:54.9129538Z 2025-05-07T20:32:54.9129673Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9129805Z op = silu_mul_quant 2025-05-07T20:32:54.9129896Z if compiled: 2025-05-07T20:32:54.9129995Z op = torch.compile(op) 2025-05-07T20:32:54.9130105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9130184Z 2025-05-07T20:32:54.9130274Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9130278Z 2025-05-07T20:32:54.9130375Z moe/activation_test.py:117: 2025-05-07T20:32:54.9130508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9130608Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9130708Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9131204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9131341Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9131708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9131928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9132265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9132366Z kernel = self.compile( 2025-05-07T20:32:54.9132745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9132923Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9133050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9133054Z 2025-05-07T20:32:54.9133255Z self = 2025-05-07T20:32:54.9134130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9134625Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a601c60>} 2025-05-07T20:32:54.9135405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9135595Z context = 2025-05-07T20:32:54.9135599Z 2025-05-07T20:32:54.9135763Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9136023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9136134Z module_map=module_map) 2025-05-07T20:32:54.9136338Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9136437Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9136513Z E ^ 2025-05-07T20:32:54.9136867Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9136874Z 2025-05-07T20:32:54.9137278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9137282Z 2025-05-07T20:32:54.9137389Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9137609Z self=, 2025-05-07T20:32:54.9137686Z T=16384, 2025-05-07T20:32:54.9137767Z D=5120, 2025-05-07T20:32:54.9137853Z scale_ub=1200.0, 2025-05-07T20:32:54.9137941Z contiguous=True, 2025-05-07T20:32:54.9138028Z compiled=True, 2025-05-07T20:32:54.9138101Z ) 2025-05-07T20:32:54.9138318Z self = 2025-05-07T20:32:54.9138535Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9138540Z 2025-05-07T20:32:54.9138616Z @given( 2025-05-07T20:32:54.9138737Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9138839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9138954Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9139075Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9139187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9139261Z ) 2025-05-07T20:32:54.9139511Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9139605Z def test_silu_mul_quant( 2025-05-07T20:32:54.9139728Z self, 2025-05-07T20:32:54.9143231Z T: int, 2025-05-07T20:32:54.9143317Z D: int, 2025-05-07T20:32:54.9143428Z scale_ub: Optional[float], 2025-05-07T20:32:54.9143521Z contiguous: bool, 2025-05-07T20:32:54.9143609Z compiled: bool, 2025-05-07T20:32:54.9143694Z ) -> None: 2025-05-07T20:32:54.9143790Z torch.manual_seed(2025) 2025-05-07T20:32:54.9143866Z 2025-05-07T20:32:54.9144043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9144117Z 2025-05-07T20:32:54.9144213Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9144336Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9144424Z x = x_sign * x_clamp 2025-05-07T20:32:54.9144509Z x0 = x[:, :D] 2025-05-07T20:32:54.9144590Z x1 = x[:, D:] 2025-05-07T20:32:54.9144663Z 2025-05-07T20:32:54.9144755Z if contiguous: 2025-05-07T20:32:54.9144846Z x0 = x0.contiguous() 2025-05-07T20:32:54.9144940Z x1 = x1.contiguous() 2025-05-07T20:32:54.9145017Z 2025-05-07T20:32:54.9145107Z if scale_ub is not None: 2025-05-07T20:32:54.9145216Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9145358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9145434Z ) 2025-05-07T20:32:54.9145510Z else: 2025-05-07T20:32:54.9145674Z scale_ub_tensor = None 2025-05-07T20:32:54.9145744Z 2025-05-07T20:32:54.9145880Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9145969Z op = silu_mul_quant 2025-05-07T20:32:54.9146053Z if compiled: 2025-05-07T20:32:54.9146156Z op = torch.compile(op) 2025-05-07T20:32:54.9146263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9146334Z 2025-05-07T20:32:54.9146429Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9146434Z 2025-05-07T20:32:54.9146530Z moe/activation_test.py:117: 2025-05-07T20:32:54.9146665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9146814Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9146919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9147292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9147389Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9147875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9147975Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9148328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9148549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9148890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9148986Z kernel = self.compile( 2025-05-07T20:32:54.9149370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9149581Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9149711Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9149718Z 2025-05-07T20:32:54.9149929Z self = 2025-05-07T20:32:54.9150691Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9151196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a603380>} 2025-05-07T20:32:54.9151992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9152186Z context = 2025-05-07T20:32:54.9152193Z 2025-05-07T20:32:54.9152355Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9152612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9152724Z module_map=module_map) 2025-05-07T20:32:54.9152887Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9152986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9153072Z E ^ 2025-05-07T20:32:54.9153423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9153431Z 2025-05-07T20:32:54.9153844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9153849Z 2025-05-07T20:32:54.9153952Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9154171Z self=, 2025-05-07T20:32:54.9154296Z T=16384, 2025-05-07T20:32:54.9154372Z D=5120, 2025-05-07T20:32:54.9154457Z scale_ub=None, 2025-05-07T20:32:54.9154551Z contiguous=False, 2025-05-07T20:32:54.9154635Z compiled=True, 2025-05-07T20:32:54.9154721Z ) 2025-05-07T20:32:54.9154936Z self = 2025-05-07T20:32:54.9155111Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9155116Z 2025-05-07T20:32:54.9155201Z @given( 2025-05-07T20:32:54.9155319Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9155423Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9155544Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9155704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9155820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9155901Z ) 2025-05-07T20:32:54.9156143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9156244Z def test_silu_mul_quant( 2025-05-07T20:32:54.9156323Z self, 2025-05-07T20:32:54.9156401Z T: int, 2025-05-07T20:32:54.9156485Z D: int, 2025-05-07T20:32:54.9156587Z scale_ub: Optional[float], 2025-05-07T20:32:54.9156679Z contiguous: bool, 2025-05-07T20:32:54.9156771Z compiled: bool, 2025-05-07T20:32:54.9156849Z ) -> None: 2025-05-07T20:32:54.9156948Z torch.manual_seed(2025) 2025-05-07T20:32:54.9157020Z 2025-05-07T20:32:54.9157187Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9157269Z 2025-05-07T20:32:54.9157360Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9157486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9157620Z x = x_sign * x_clamp 2025-05-07T20:32:54.9157703Z x0 = x[:, :D] 2025-05-07T20:32:54.9157782Z x1 = x[:, D:] 2025-05-07T20:32:54.9157862Z 2025-05-07T20:32:54.9157948Z if contiguous: 2025-05-07T20:32:54.9158042Z x0 = x0.contiguous() 2025-05-07T20:32:54.9158132Z x1 = x1.contiguous() 2025-05-07T20:32:54.9158205Z 2025-05-07T20:32:54.9158302Z if scale_ub is not None: 2025-05-07T20:32:54.9158409Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9158544Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9158629Z ) 2025-05-07T20:32:54.9158706Z else: 2025-05-07T20:32:54.9158799Z scale_ub_tensor = None 2025-05-07T20:32:54.9158918Z 2025-05-07T20:32:54.9159047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9159136Z op = silu_mul_quant 2025-05-07T20:32:54.9159229Z if compiled: 2025-05-07T20:32:54.9159331Z op = torch.compile(op) 2025-05-07T20:32:54.9159438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9159511Z 2025-05-07T20:32:54.9159604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9159609Z 2025-05-07T20:32:54.9159710Z moe/activation_test.py:117: 2025-05-07T20:32:54.9159839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9159937Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9160042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9160444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9160551Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9161039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9161141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9161501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9161720Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9162125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9162223Z kernel = self.compile( 2025-05-07T20:32:54.9162597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9162773Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9162900Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9162911Z 2025-05-07T20:32:54.9163111Z self = 2025-05-07T20:32:54.9163916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9164419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3a85e0>} 2025-05-07T20:32:54.9165158Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9165347Z context = 2025-05-07T20:32:54.9165352Z 2025-05-07T20:32:54.9165515Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9165778Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9165888Z module_map=module_map) 2025-05-07T20:32:54.9166095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9166194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9166273Z E ^ 2025-05-07T20:32:54.9166626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9166631Z 2025-05-07T20:32:54.9167034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9167039Z 2025-05-07T20:32:54.9167146Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9167365Z self=, 2025-05-07T20:32:54.9167441Z T=2048, 2025-05-07T20:32:54.9167563Z D=5120, 2025-05-07T20:32:54.9167646Z scale_ub=None, 2025-05-07T20:32:54.9167732Z contiguous=False, 2025-05-07T20:32:54.9167818Z compiled=True, 2025-05-07T20:32:54.9167892Z ) 2025-05-07T20:32:54.9168110Z self = 2025-05-07T20:32:54.9168285Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9168292Z 2025-05-07T20:32:54.9168367Z @given( 2025-05-07T20:32:54.9168491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9168586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9168698Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9168816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9168929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9169002Z ) 2025-05-07T20:32:54.9169247Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9169342Z def test_silu_mul_quant( 2025-05-07T20:32:54.9169417Z self, 2025-05-07T20:32:54.9169500Z T: int, 2025-05-07T20:32:54.9169578Z D: int, 2025-05-07T20:32:54.9169677Z scale_ub: Optional[float], 2025-05-07T20:32:54.9169772Z contiguous: bool, 2025-05-07T20:32:54.9169857Z compiled: bool, 2025-05-07T20:32:54.9169941Z ) -> None: 2025-05-07T20:32:54.9170076Z torch.manual_seed(2025) 2025-05-07T20:32:54.9170148Z 2025-05-07T20:32:54.9170324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9170410Z 2025-05-07T20:32:54.9170512Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9170663Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9170752Z x = x_sign * x_clamp 2025-05-07T20:32:54.9170832Z x0 = x[:, :D] 2025-05-07T20:32:54.9170914Z x1 = x[:, D:] 2025-05-07T20:32:54.9170986Z 2025-05-07T20:32:54.9171069Z if contiguous: 2025-05-07T20:32:54.9171169Z x0 = x0.contiguous() 2025-05-07T20:32:54.9171258Z x1 = x1.contiguous() 2025-05-07T20:32:54.9171334Z 2025-05-07T20:32:54.9171466Z if scale_ub is not None: 2025-05-07T20:32:54.9171573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9171712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9171794Z ) 2025-05-07T20:32:54.9171872Z else: 2025-05-07T20:32:54.9171967Z scale_ub_tensor = None 2025-05-07T20:32:54.9172041Z 2025-05-07T20:32:54.9172170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9172263Z op = silu_mul_quant 2025-05-07T20:32:54.9172348Z if compiled: 2025-05-07T20:32:54.9172447Z op = torch.compile(op) 2025-05-07T20:32:54.9172555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9172628Z 2025-05-07T20:32:54.9172721Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9172728Z 2025-05-07T20:32:54.9172824Z moe/activation_test.py:117: 2025-05-07T20:32:54.9172954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9173056Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9173197Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9173562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9173769Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9174257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9174358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9174710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9174929Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9175312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9175405Z kernel = self.compile( 2025-05-07T20:32:54.9175790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9175969Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9176097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9176101Z 2025-05-07T20:32:54.9176310Z self = 2025-05-07T20:32:54.9177073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9177569Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3a9440>} 2025-05-07T20:32:54.9178311Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9178502Z context = 2025-05-07T20:32:54.9178547Z 2025-05-07T20:32:54.9178717Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9178972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9179082Z module_map=module_map) 2025-05-07T20:32:54.9179244Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9179343Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9179425Z E ^ 2025-05-07T20:32:54.9179773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9179780Z 2025-05-07T20:32:54.9180230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9180234Z 2025-05-07T20:32:54.9180342Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9180604Z self=, 2025-05-07T20:32:54.9180693Z T=2048, 2025-05-07T20:32:54.9180768Z D=5120, 2025-05-07T20:32:54.9180852Z scale_ub=1200.0, 2025-05-07T20:32:54.9180942Z contiguous=False, 2025-05-07T20:32:54.9181025Z compiled=True, 2025-05-07T20:32:54.9181098Z ) 2025-05-07T20:32:54.9181318Z self = 2025-05-07T20:32:54.9181488Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9181493Z 2025-05-07T20:32:54.9181569Z @given( 2025-05-07T20:32:54.9181694Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9181791Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9181911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9182068Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9182183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9182265Z ) 2025-05-07T20:32:54.9182512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9182606Z def test_silu_mul_quant( 2025-05-07T20:32:54.9182685Z self, 2025-05-07T20:32:54.9182762Z T: int, 2025-05-07T20:32:54.9182837Z D: int, 2025-05-07T20:32:54.9182939Z scale_ub: Optional[float], 2025-05-07T20:32:54.9183028Z contiguous: bool, 2025-05-07T20:32:54.9183113Z compiled: bool, 2025-05-07T20:32:54.9183194Z ) -> None: 2025-05-07T20:32:54.9183288Z torch.manual_seed(2025) 2025-05-07T20:32:54.9183404Z 2025-05-07T20:32:54.9183572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9183642Z 2025-05-07T20:32:54.9183738Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9183863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9183951Z x = x_sign * x_clamp 2025-05-07T20:32:54.9184034Z x0 = x[:, :D] 2025-05-07T20:32:54.9184120Z x1 = x[:, D:] 2025-05-07T20:32:54.9184193Z 2025-05-07T20:32:54.9184280Z if contiguous: 2025-05-07T20:32:54.9184370Z x0 = x0.contiguous() 2025-05-07T20:32:54.9184457Z x1 = x1.contiguous() 2025-05-07T20:32:54.9184534Z 2025-05-07T20:32:54.9184624Z if scale_ub is not None: 2025-05-07T20:32:54.9184728Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9184861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9184935Z ) 2025-05-07T20:32:54.9185013Z else: 2025-05-07T20:32:54.9185108Z scale_ub_tensor = None 2025-05-07T20:32:54.9185181Z 2025-05-07T20:32:54.9185315Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9185405Z op = silu_mul_quant 2025-05-07T20:32:54.9185492Z if compiled: 2025-05-07T20:32:54.9185594Z op = torch.compile(op) 2025-05-07T20:32:54.9185698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9185815Z 2025-05-07T20:32:54.9185908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9185913Z 2025-05-07T20:32:54.9186007Z moe/activation_test.py:117: 2025-05-07T20:32:54.9186138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9186239Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9186335Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9186700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9186796Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9187320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9187423Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9187775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9188002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9188336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9188429Z kernel = self.compile( 2025-05-07T20:32:54.9188815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9188986Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9189113Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9189124Z 2025-05-07T20:32:54.9189331Z self = 2025-05-07T20:32:54.9190132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9190633Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3aa660>} 2025-05-07T20:32:54.9191363Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9191556Z context = 2025-05-07T20:32:54.9191621Z 2025-05-07T20:32:54.9191784Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9192046Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9192158Z module_map=module_map) 2025-05-07T20:32:54.9192316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9192421Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9192497Z E ^ 2025-05-07T20:32:54.9192845Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9192850Z 2025-05-07T20:32:54.9193259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9193263Z 2025-05-07T20:32:54.9193365Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9193583Z self=, 2025-05-07T20:32:54.9193664Z T=4096, 2025-05-07T20:32:54.9193739Z D=5120, 2025-05-07T20:32:54.9193825Z scale_ub=1200.0, 2025-05-07T20:32:54.9193910Z contiguous=True, 2025-05-07T20:32:54.9193992Z compiled=True, 2025-05-07T20:32:54.9194072Z ) 2025-05-07T20:32:54.9194289Z self = 2025-05-07T20:32:54.9194502Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9194507Z 2025-05-07T20:32:54.9194586Z @given( 2025-05-07T20:32:54.9194703Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9194801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9194917Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9195033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9195148Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9195221Z ) 2025-05-07T20:32:54.9195463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9195563Z def test_silu_mul_quant( 2025-05-07T20:32:54.9195640Z self, 2025-05-07T20:32:54.9195758Z T: int, 2025-05-07T20:32:54.9195844Z D: int, 2025-05-07T20:32:54.9195942Z scale_ub: Optional[float], 2025-05-07T20:32:54.9196031Z contiguous: bool, 2025-05-07T20:32:54.9196125Z compiled: bool, 2025-05-07T20:32:54.9196203Z ) -> None: 2025-05-07T20:32:54.9196297Z torch.manual_seed(2025) 2025-05-07T20:32:54.9196372Z 2025-05-07T20:32:54.9196538Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9196613Z 2025-05-07T20:32:54.9196705Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9196827Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9196918Z x = x_sign * x_clamp 2025-05-07T20:32:54.9196998Z x0 = x[:, :D] 2025-05-07T20:32:54.9197077Z x1 = x[:, D:] 2025-05-07T20:32:54.9197153Z 2025-05-07T20:32:54.9197236Z if contiguous: 2025-05-07T20:32:54.9197325Z x0 = x0.contiguous() 2025-05-07T20:32:54.9197418Z x1 = x1.contiguous() 2025-05-07T20:32:54.9197490Z 2025-05-07T20:32:54.9197620Z if scale_ub is not None: 2025-05-07T20:32:54.9197730Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9197864Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9197947Z ) 2025-05-07T20:32:54.9198022Z else: 2025-05-07T20:32:54.9198119Z scale_ub_tensor = None 2025-05-07T20:32:54.9198507Z 2025-05-07T20:32:54.9198691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9198785Z op = silu_mul_quant 2025-05-07T20:32:54.9198873Z if compiled: 2025-05-07T20:32:54.9198972Z op = torch.compile(op) 2025-05-07T20:32:54.9199076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9199153Z 2025-05-07T20:32:54.9199343Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9199348Z 2025-05-07T20:32:54.9199442Z moe/activation_test.py:117: 2025-05-07T20:32:54.9199584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9199684Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9199787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9200154Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9200246Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9200736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9200833Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9201187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9201406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9201755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9201847Z kernel = self.compile( 2025-05-07T20:32:54.9202232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9202476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9202603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9202607Z 2025-05-07T20:32:54.9202813Z self = 2025-05-07T20:32:54.9203575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9204136Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a3ab9c0>} 2025-05-07T20:32:54.9204872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9205065Z context = 2025-05-07T20:32:54.9205069Z 2025-05-07T20:32:54.9205239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9205498Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9205613Z module_map=module_map) 2025-05-07T20:32:54.9205770Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9205867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9205950Z E ^ 2025-05-07T20:32:54.9206298Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9206302Z 2025-05-07T20:32:54.9206770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9206774Z 2025-05-07T20:32:54.9206881Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9207102Z self=, 2025-05-07T20:32:54.9207182Z T=128, 2025-05-07T20:32:54.9207258Z D=5120, 2025-05-07T20:32:54.9207340Z scale_ub=1200.0, 2025-05-07T20:32:54.9207427Z contiguous=False, 2025-05-07T20:32:54.9207511Z compiled=True, 2025-05-07T20:32:54.9207586Z ) 2025-05-07T20:32:54.9207804Z self = 2025-05-07T20:32:54.9207970Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9208016Z 2025-05-07T20:32:54.9208095Z @given( 2025-05-07T20:32:54.9208212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9208311Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9208432Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9208547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9208661Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9208739Z ) 2025-05-07T20:32:54.9208981Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9209074Z def test_silu_mul_quant( 2025-05-07T20:32:54.9209154Z self, 2025-05-07T20:32:54.9209230Z T: int, 2025-05-07T20:32:54.9209309Z D: int, 2025-05-07T20:32:54.9209406Z scale_ub: Optional[float], 2025-05-07T20:32:54.9209494Z contiguous: bool, 2025-05-07T20:32:54.9209582Z compiled: bool, 2025-05-07T20:32:54.9209660Z ) -> None: 2025-05-07T20:32:54.9209757Z torch.manual_seed(2025) 2025-05-07T20:32:54.9209834Z 2025-05-07T20:32:54.9210000Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9210072Z 2025-05-07T20:32:54.9210172Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9210294Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9210429Z x = x_sign * x_clamp 2025-05-07T20:32:54.9210530Z x0 = x[:, :D] 2025-05-07T20:32:54.9210619Z x1 = x[:, D:] 2025-05-07T20:32:54.9210703Z 2025-05-07T20:32:54.9210801Z if contiguous: 2025-05-07T20:32:54.9210892Z x0 = x0.contiguous() 2025-05-07T20:32:54.9210986Z x1 = x1.contiguous() 2025-05-07T20:32:54.9211056Z 2025-05-07T20:32:54.9211146Z if scale_ub is not None: 2025-05-07T20:32:54.9211253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9211387Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9211465Z ) 2025-05-07T20:32:54.9211544Z else: 2025-05-07T20:32:54.9211638Z scale_ub_tensor = None 2025-05-07T20:32:54.9211709Z 2025-05-07T20:32:54.9211883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9211977Z op = silu_mul_quant 2025-05-07T20:32:54.9212059Z if compiled: 2025-05-07T20:32:54.9212162Z op = torch.compile(op) 2025-05-07T20:32:54.9212270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9212345Z 2025-05-07T20:32:54.9212438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9212443Z 2025-05-07T20:32:54.9212539Z moe/activation_test.py:117: 2025-05-07T20:32:54.9212672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9212770Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9212870Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9213234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9213327Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9213933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9214031Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9214386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9214616Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9214953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9215044Z kernel = self.compile( 2025-05-07T20:32:54.9215424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9215596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9215765Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9215770Z 2025-05-07T20:32:54.9215976Z self = 2025-05-07T20:32:54.9216743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9217244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a548fe0>} 2025-05-07T20:32:54.9217975Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9218168Z context = 2025-05-07T20:32:54.9218176Z 2025-05-07T20:32:54.9218336Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9218602Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9218710Z module_map=module_map) 2025-05-07T20:32:54.9218912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9219014Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9219093Z E ^ 2025-05-07T20:32:54.9219442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9219447Z 2025-05-07T20:32:54.9219858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9219863Z 2025-05-07T20:32:54.9219965Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9220189Z self=, 2025-05-07T20:32:54.9220265Z T=16384, 2025-05-07T20:32:54.9220342Z D=7168, 2025-05-07T20:32:54.9220466Z scale_ub=1200.0, 2025-05-07T20:32:54.9220553Z contiguous=True, 2025-05-07T20:32:54.9220636Z compiled=True, 2025-05-07T20:32:54.9220712Z ) 2025-05-07T20:32:54.9220927Z self = 2025-05-07T20:32:54.9221100Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9221108Z 2025-05-07T20:32:54.9221184Z @given( 2025-05-07T20:32:54.9221301Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9221404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9221520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9221635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9221753Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9221828Z ) 2025-05-07T20:32:54.9222069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9222167Z def test_silu_mul_quant( 2025-05-07T20:32:54.9222241Z self, 2025-05-07T20:32:54.9222384Z T: int, 2025-05-07T20:32:54.9222465Z D: int, 2025-05-07T20:32:54.9222562Z scale_ub: Optional[float], 2025-05-07T20:32:54.9222656Z contiguous: bool, 2025-05-07T20:32:54.9222741Z compiled: bool, 2025-05-07T20:32:54.9222817Z ) -> None: 2025-05-07T20:32:54.9222913Z torch.manual_seed(2025) 2025-05-07T20:32:54.9222984Z 2025-05-07T20:32:54.9223148Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9223224Z 2025-05-07T20:32:54.9223312Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9223434Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9223522Z x = x_sign * x_clamp 2025-05-07T20:32:54.9223602Z x0 = x[:, :D] 2025-05-07T20:32:54.9223723Z x1 = x[:, D:] 2025-05-07T20:32:54.9223798Z 2025-05-07T20:32:54.9223879Z if contiguous: 2025-05-07T20:32:54.9223973Z x0 = x0.contiguous() 2025-05-07T20:32:54.9224066Z x1 = x1.contiguous() 2025-05-07T20:32:54.9224137Z 2025-05-07T20:32:54.9224230Z if scale_ub is not None: 2025-05-07T20:32:54.9224334Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9224469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9224547Z ) 2025-05-07T20:32:54.9224621Z else: 2025-05-07T20:32:54.9224714Z scale_ub_tensor = None 2025-05-07T20:32:54.9224790Z 2025-05-07T20:32:54.9224919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9225005Z op = silu_mul_quant 2025-05-07T20:32:54.9225094Z if compiled: 2025-05-07T20:32:54.9225192Z op = torch.compile(op) 2025-05-07T20:32:54.9225297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9225376Z 2025-05-07T20:32:54.9225468Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9225472Z 2025-05-07T20:32:54.9225576Z moe/activation_test.py:117: 2025-05-07T20:32:54.9225708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9225807Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9225958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9226321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9226414Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9226906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9227002Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9227358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9227580Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9227956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9228053Z kernel = self.compile( 2025-05-07T20:32:54.9228430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9228603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9228737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9228741Z 2025-05-07T20:32:54.9228940Z self = 2025-05-07T20:32:54.9229701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9230240Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a549e40>} 2025-05-07T20:32:54.9230977Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9231166Z context = 2025-05-07T20:32:54.9231171Z 2025-05-07T20:32:54.9231333Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9231594Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9231702Z module_map=module_map) 2025-05-07T20:32:54.9231864Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9232000Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9232077Z E ^ 2025-05-07T20:32:54.9232432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9232439Z 2025-05-07T20:32:54.9232843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9232850Z 2025-05-07T20:32:54.9232957Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9233173Z self=, 2025-05-07T20:32:54.9233249Z T=16384, 2025-05-07T20:32:54.9233326Z D=5120, 2025-05-07T20:32:54.9233407Z scale_ub=1200.0, 2025-05-07T20:32:54.9233490Z contiguous=True, 2025-05-07T20:32:54.9233576Z compiled=False, 2025-05-07T20:32:54.9233648Z ) 2025-05-07T20:32:54.9233863Z self = 2025-05-07T20:32:54.9234047Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9234052Z 2025-05-07T20:32:54.9234126Z @given( 2025-05-07T20:32:54.9234245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9234349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9234463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9234622Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9234736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9234807Z ) 2025-05-07T20:32:54.9235055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9235147Z def test_silu_mul_quant( 2025-05-07T20:32:54.9235221Z self, 2025-05-07T20:32:54.9235299Z T: int, 2025-05-07T20:32:54.9235374Z D: int, 2025-05-07T20:32:54.9235472Z scale_ub: Optional[float], 2025-05-07T20:32:54.9235563Z contiguous: bool, 2025-05-07T20:32:54.9235651Z compiled: bool, 2025-05-07T20:32:54.9235732Z ) -> None: 2025-05-07T20:32:54.9235825Z torch.manual_seed(2025) 2025-05-07T20:32:54.9235897Z 2025-05-07T20:32:54.9236110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9236183Z 2025-05-07T20:32:54.9236274Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9236401Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9236491Z x = x_sign * x_clamp 2025-05-07T20:32:54.9236569Z x0 = x[:, :D] 2025-05-07T20:32:54.9236654Z x1 = x[:, D:] 2025-05-07T20:32:54.9236725Z 2025-05-07T20:32:54.9236808Z if contiguous: 2025-05-07T20:32:54.9236901Z x0 = x0.contiguous() 2025-05-07T20:32:54.9236988Z x1 = x1.contiguous() 2025-05-07T20:32:54.9237059Z 2025-05-07T20:32:54.9237150Z if scale_ub is not None: 2025-05-07T20:32:54.9237254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9237391Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9237469Z ) 2025-05-07T20:32:54.9237544Z else: 2025-05-07T20:32:54.9237643Z scale_ub_tensor = None 2025-05-07T20:32:54.9237713Z 2025-05-07T20:32:54.9237883Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9237980Z op = silu_mul_quant 2025-05-07T20:32:54.9238066Z if compiled: 2025-05-07T20:32:54.9238164Z op = torch.compile(op) 2025-05-07T20:32:54.9238273Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9238348Z 2025-05-07T20:32:54.9238439Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9238446Z 2025-05-07T20:32:54.9238543Z moe/activation_test.py:117: 2025-05-07T20:32:54.9238671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9238774Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9238873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9239402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9239506Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9239865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9240089Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9240428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9240519Z kernel = self.compile( 2025-05-07T20:32:54.9240899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9241070Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9241195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9241202Z 2025-05-07T20:32:54.9241407Z self = 2025-05-07T20:32:54.9242172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9242714Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a54aca0>} 2025-05-07T20:32:54.9243447Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9243640Z context = 2025-05-07T20:32:54.9243645Z 2025-05-07T20:32:54.9243805Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9244105Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9244218Z module_map=module_map) 2025-05-07T20:32:54.9244378Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9244475Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9244558Z E ^ 2025-05-07T20:32:54.9244904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9244908Z 2025-05-07T20:32:54.9245314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9245319Z 2025-05-07T20:32:54.9245422Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9245641Z self=, 2025-05-07T20:32:54.9245719Z T=1, 2025-05-07T20:32:54.9245799Z D=7168, 2025-05-07T20:32:54.9245880Z scale_ub=1200.0, 2025-05-07T20:32:54.9245968Z contiguous=False, 2025-05-07T20:32:54.9246054Z compiled=False, 2025-05-07T20:32:54.9246130Z ) 2025-05-07T20:32:54.9246383Z self = 2025-05-07T20:32:54.9246548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9246555Z 2025-05-07T20:32:54.9246633Z @given( 2025-05-07T20:32:54.9246750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9246848Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9246963Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9247078Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9247189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9247265Z ) 2025-05-07T20:32:54.9247504Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9247640Z def test_silu_mul_quant( 2025-05-07T20:32:54.9247717Z self, 2025-05-07T20:32:54.9247794Z T: int, 2025-05-07T20:32:54.9247875Z D: int, 2025-05-07T20:32:54.9247975Z scale_ub: Optional[float], 2025-05-07T20:32:54.9248064Z contiguous: bool, 2025-05-07T20:32:54.9248152Z compiled: bool, 2025-05-07T20:32:54.9248232Z ) -> None: 2025-05-07T20:32:54.9248328Z torch.manual_seed(2025) 2025-05-07T20:32:54.9248405Z 2025-05-07T20:32:54.9248569Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9248640Z 2025-05-07T20:32:54.9248736Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9248857Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9248948Z x = x_sign * x_clamp 2025-05-07T20:32:54.9249028Z x0 = x[:, :D] 2025-05-07T20:32:54.9249107Z x1 = x[:, D:] 2025-05-07T20:32:54.9249181Z 2025-05-07T20:32:54.9249262Z if contiguous: 2025-05-07T20:32:54.9249355Z x0 = x0.contiguous() 2025-05-07T20:32:54.9249447Z x1 = x1.contiguous() 2025-05-07T20:32:54.9249522Z 2025-05-07T20:32:54.9249611Z if scale_ub is not None: 2025-05-07T20:32:54.9249722Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9249854Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9249974Z ) 2025-05-07T20:32:54.9250051Z else: 2025-05-07T20:32:54.9250142Z scale_ub_tensor = None 2025-05-07T20:32:54.9250215Z 2025-05-07T20:32:54.9250347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9250436Z op = silu_mul_quant 2025-05-07T20:32:54.9250522Z if compiled: 2025-05-07T20:32:54.9250619Z op = torch.compile(op) 2025-05-07T20:32:54.9250722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9250798Z 2025-05-07T20:32:54.9250887Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9250894Z 2025-05-07T20:32:54.9250989Z moe/activation_test.py:117: 2025-05-07T20:32:54.9251182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9251285Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9251383Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9251876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9251977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9252337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9252553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9252891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9252986Z kernel = self.compile( 2025-05-07T20:32:54.9253366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9253545Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9253800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9253805Z 2025-05-07T20:32:54.9254007Z self = 2025-05-07T20:32:54.9254776Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9255270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04c0e0>} 2025-05-07T20:32:54.9256007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9256242Z context = 2025-05-07T20:32:54.9256246Z 2025-05-07T20:32:54.9256408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9256672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9256778Z module_map=module_map) 2025-05-07T20:32:54.9256942Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9257038Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9257113Z E ^ 2025-05-07T20:32:54.9257462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9257466Z 2025-05-07T20:32:54.9257871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9257878Z 2025-05-07T20:32:54.9257984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9258206Z self=, 2025-05-07T20:32:54.9258283Z T=4096, 2025-05-07T20:32:54.9258363Z D=7168, 2025-05-07T20:32:54.9258486Z scale_ub=1200.0, 2025-05-07T20:32:54.9258571Z contiguous=False, 2025-05-07T20:32:54.9258657Z compiled=True, 2025-05-07T20:32:54.9258729Z ) 2025-05-07T20:32:54.9258942Z self = 2025-05-07T20:32:54.9259117Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9259121Z 2025-05-07T20:32:54.9259196Z @given( 2025-05-07T20:32:54.9259315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9259412Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9259530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9259646Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9259798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9259875Z ) 2025-05-07T20:32:54.9260122Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9260215Z def test_silu_mul_quant( 2025-05-07T20:32:54.9260294Z self, 2025-05-07T20:32:54.9260374Z T: int, 2025-05-07T20:32:54.9260450Z D: int, 2025-05-07T20:32:54.9260550Z scale_ub: Optional[float], 2025-05-07T20:32:54.9260637Z contiguous: bool, 2025-05-07T20:32:54.9260722Z compiled: bool, 2025-05-07T20:32:54.9260801Z ) -> None: 2025-05-07T20:32:54.9260893Z torch.manual_seed(2025) 2025-05-07T20:32:54.9260962Z 2025-05-07T20:32:54.9261130Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9261203Z 2025-05-07T20:32:54.9261301Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9264374Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9264485Z x = x_sign * x_clamp 2025-05-07T20:32:54.9264571Z x0 = x[:, :D] 2025-05-07T20:32:54.9264714Z x1 = x[:, D:] 2025-05-07T20:32:54.9264788Z 2025-05-07T20:32:54.9264876Z if contiguous: 2025-05-07T20:32:54.9264966Z x0 = x0.contiguous() 2025-05-07T20:32:54.9265056Z x1 = x1.contiguous() 2025-05-07T20:32:54.9265129Z 2025-05-07T20:32:54.9265218Z if scale_ub is not None: 2025-05-07T20:32:54.9265325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9265464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9265538Z ) 2025-05-07T20:32:54.9265615Z else: 2025-05-07T20:32:54.9265707Z scale_ub_tensor = None 2025-05-07T20:32:54.9265778Z 2025-05-07T20:32:54.9265909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9266042Z op = silu_mul_quant 2025-05-07T20:32:54.9266128Z if compiled: 2025-05-07T20:32:54.9266233Z op = torch.compile(op) 2025-05-07T20:32:54.9266339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9266412Z 2025-05-07T20:32:54.9266506Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9266511Z 2025-05-07T20:32:54.9266610Z moe/activation_test.py:117: 2025-05-07T20:32:54.9266743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9266845Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9266945Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9267315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9267406Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9267890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9267993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9268347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9268572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9268908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9269045Z kernel = self.compile( 2025-05-07T20:32:54.9269426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9269596Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9269722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9269730Z 2025-05-07T20:32:54.9269933Z self = 2025-05-07T20:32:54.9270741Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9271244Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04d300>} 2025-05-07T20:32:54.9271980Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9272169Z context = 2025-05-07T20:32:54.9272173Z 2025-05-07T20:32:54.9272335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9272592Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9272702Z module_map=module_map) 2025-05-07T20:32:54.9272863Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9272999Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9273080Z E ^ 2025-05-07T20:32:54.9273427Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9273435Z 2025-05-07T20:32:54.9273844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9273849Z 2025-05-07T20:32:54.9273950Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9274168Z self=, 2025-05-07T20:32:54.9274248Z T=128, 2025-05-07T20:32:54.9274324Z D=7168, 2025-05-07T20:32:54.9274409Z scale_ub=1200.0, 2025-05-07T20:32:54.9274496Z contiguous=False, 2025-05-07T20:32:54.9274620Z compiled=True, 2025-05-07T20:32:54.9274698Z ) 2025-05-07T20:32:54.9274914Z self = 2025-05-07T20:32:54.9275083Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:54.9275088Z 2025-05-07T20:32:54.9275168Z @given( 2025-05-07T20:32:54.9275285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9275383Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9275501Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9275616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9275732Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9275804Z ) 2025-05-07T20:32:54.9276047Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9276142Z def test_silu_mul_quant( 2025-05-07T20:32:54.9276217Z self, 2025-05-07T20:32:54.9276297Z T: int, 2025-05-07T20:32:54.9276375Z D: int, 2025-05-07T20:32:54.9276471Z scale_ub: Optional[float], 2025-05-07T20:32:54.9276560Z contiguous: bool, 2025-05-07T20:32:54.9276646Z compiled: bool, 2025-05-07T20:32:54.9276730Z ) -> None: 2025-05-07T20:32:54.9276823Z torch.manual_seed(2025) 2025-05-07T20:32:54.9276900Z 2025-05-07T20:32:54.9277107Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9277183Z 2025-05-07T20:32:54.9277273Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9277395Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9277489Z x = x_sign * x_clamp 2025-05-07T20:32:54.9277570Z x0 = x[:, :D] 2025-05-07T20:32:54.9277650Z x1 = x[:, D:] 2025-05-07T20:32:54.9277726Z 2025-05-07T20:32:54.9277808Z if contiguous: 2025-05-07T20:32:54.9277902Z x0 = x0.contiguous() 2025-05-07T20:32:54.9277989Z x1 = x1.contiguous() 2025-05-07T20:32:54.9278065Z 2025-05-07T20:32:54.9278157Z if scale_ub is not None: 2025-05-07T20:32:54.9278302Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9278441Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9278518Z ) 2025-05-07T20:32:54.9278594Z else: 2025-05-07T20:32:54.9278686Z scale_ub_tensor = None 2025-05-07T20:32:54.9278762Z 2025-05-07T20:32:54.9278889Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9278980Z op = silu_mul_quant 2025-05-07T20:32:54.9279064Z if compiled: 2025-05-07T20:32:54.9279161Z op = torch.compile(op) 2025-05-07T20:32:54.9279272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9279342Z 2025-05-07T20:32:54.9279430Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9279434Z 2025-05-07T20:32:54.9279532Z moe/activation_test.py:117: 2025-05-07T20:32:54.9279660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9279762Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9279866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9280277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9280388Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9280901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9280998Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9281353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9281570Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9281905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9282051Z kernel = self.compile( 2025-05-07T20:32:54.9282435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9282612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9282737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9282744Z 2025-05-07T20:32:54.9282943Z self = 2025-05-07T20:32:54.9283706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9284200Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04e160>} 2025-05-07T20:32:54.9284943Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9285131Z context = 2025-05-07T20:32:54.9285136Z 2025-05-07T20:32:54.9285299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9285621Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9285727Z module_map=module_map) 2025-05-07T20:32:54.9285889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9285986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9286060Z E ^ 2025-05-07T20:32:54.9286411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9286418Z 2025-05-07T20:32:54.9286825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9286869Z 2025-05-07T20:32:54.9286978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9287195Z self=, 2025-05-07T20:32:54.9287270Z T=2048, 2025-05-07T20:32:54.9287350Z D=7168, 2025-05-07T20:32:54.9287431Z scale_ub=None, 2025-05-07T20:32:54.9287513Z contiguous=True, 2025-05-07T20:32:54.9287598Z compiled=True, 2025-05-07T20:32:54.9287669Z ) 2025-05-07T20:32:54.9287884Z self = 2025-05-07T20:32:54.9288056Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.9288061Z 2025-05-07T20:32:54.9288135Z @given( 2025-05-07T20:32:54.9288259Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9288356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9288471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9288597Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9288748Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9288822Z ) 2025-05-07T20:32:54.9289065Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9289158Z def test_silu_mul_quant( 2025-05-07T20:32:54.9289234Z self, 2025-05-07T20:32:54.9289313Z T: int, 2025-05-07T20:32:54.9289388Z D: int, 2025-05-07T20:32:54.9289488Z scale_ub: Optional[float], 2025-05-07T20:32:54.9289576Z contiguous: bool, 2025-05-07T20:32:54.9289659Z compiled: bool, 2025-05-07T20:32:54.9289739Z ) -> None: 2025-05-07T20:32:54.9289831Z torch.manual_seed(2025) 2025-05-07T20:32:54.9289901Z 2025-05-07T20:32:54.9290068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9290182Z 2025-05-07T20:32:54.9290271Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9290407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9290494Z x = x_sign * x_clamp 2025-05-07T20:32:54.9290575Z x0 = x[:, :D] 2025-05-07T20:32:54.9290656Z x1 = x[:, D:] 2025-05-07T20:32:54.9290726Z 2025-05-07T20:32:54.9290814Z if contiguous: 2025-05-07T20:32:54.9290903Z x0 = x0.contiguous() 2025-05-07T20:32:54.9290989Z x1 = x1.contiguous() 2025-05-07T20:32:54.9291063Z 2025-05-07T20:32:54.9291151Z if scale_ub is not None: 2025-05-07T20:32:54.9291253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9291388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9291464Z ) 2025-05-07T20:32:54.9291539Z else: 2025-05-07T20:32:54.9291634Z scale_ub_tensor = None 2025-05-07T20:32:54.9291704Z 2025-05-07T20:32:54.9291836Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9291928Z op = silu_mul_quant 2025-05-07T20:32:54.9292014Z if compiled: 2025-05-07T20:32:54.9292111Z op = torch.compile(op) 2025-05-07T20:32:54.9292221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9292291Z 2025-05-07T20:32:54.9292383Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9292431Z 2025-05-07T20:32:54.9292526Z moe/activation_test.py:117: 2025-05-07T20:32:54.9292653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9292753Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9292851Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9293214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9293308Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9293917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9294022Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9294421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9294642Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9294985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9295077Z kernel = self.compile( 2025-05-07T20:32:54.9295454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9295626Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9295753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9295758Z 2025-05-07T20:32:54.9295961Z self = 2025-05-07T20:32:54.9296769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9297270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe92a04f420>} 2025-05-07T20:32:54.9298003Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9298568Z context = 2025-05-07T20:32:54.9298576Z 2025-05-07T20:32:54.9298754Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9299100Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9299216Z module_map=module_map) 2025-05-07T20:32:54.9299376Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9299471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9299551Z E ^ 2025-05-07T20:32:54.9299900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9299905Z 2025-05-07T20:32:54.9300308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9300317Z 2025-05-07T20:32:54.9300416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9300635Z self=, 2025-05-07T20:32:54.9300713Z T=16384, 2025-05-07T20:32:54.9300786Z D=5120, 2025-05-07T20:32:54.9300867Z scale_ub=None, 2025-05-07T20:32:54.9300953Z contiguous=False, 2025-05-07T20:32:54.9301038Z compiled=False, 2025-05-07T20:32:54.9301110Z ) 2025-05-07T20:32:54.9301330Z self = 2025-05-07T20:32:54.9301500Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9301504Z 2025-05-07T20:32:54.9301672Z @given( 2025-05-07T20:32:54.9301788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9301884Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9302000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9302114Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9302223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9302298Z ) 2025-05-07T20:32:54.9302537Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9302627Z def test_silu_mul_quant( 2025-05-07T20:32:54.9302706Z self, 2025-05-07T20:32:54.9302783Z T: int, 2025-05-07T20:32:54.9302860Z D: int, 2025-05-07T20:32:54.9303027Z scale_ub: Optional[float], 2025-05-07T20:32:54.9303117Z contiguous: bool, 2025-05-07T20:32:54.9303204Z compiled: bool, 2025-05-07T20:32:54.9303283Z ) -> None: 2025-05-07T20:32:54.9303381Z torch.manual_seed(2025) 2025-05-07T20:32:54.9303460Z 2025-05-07T20:32:54.9303643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9303716Z 2025-05-07T20:32:54.9303810Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9303941Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9306290Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9306300Z 2025-05-07T20:32:54.9306415Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.9306422Z 2025-05-07T20:32:54.9306523Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9306741Z self=, 2025-05-07T20:32:54.9306814Z T=4096, 2025-05-07T20:32:54.9306892Z D=7168, 2025-05-07T20:32:54.9306971Z scale_ub=1200.0, 2025-05-07T20:32:54.9307053Z contiguous=True, 2025-05-07T20:32:54.9307136Z compiled=True, 2025-05-07T20:32:54.9307205Z ) 2025-05-07T20:32:54.9307416Z self = 2025-05-07T20:32:54.9307584Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9307632Z 2025-05-07T20:32:54.9307706Z @given( 2025-05-07T20:32:54.9307823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9307925Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9308035Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9308154Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9308265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9308336Z ) 2025-05-07T20:32:54.9308577Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9308666Z def test_silu_mul_quant( 2025-05-07T20:32:54.9308739Z self, 2025-05-07T20:32:54.9308815Z T: int, 2025-05-07T20:32:54.9308889Z D: int, 2025-05-07T20:32:54.9308985Z scale_ub: Optional[float], 2025-05-07T20:32:54.9309072Z contiguous: bool, 2025-05-07T20:32:54.9309154Z compiled: bool, 2025-05-07T20:32:54.9309235Z ) -> None: 2025-05-07T20:32:54.9309330Z torch.manual_seed(2025) 2025-05-07T20:32:54.9309401Z 2025-05-07T20:32:54.9309571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9309641Z 2025-05-07T20:32:54.9309729Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9309850Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9311686Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9311694Z 2025-05-07T20:32:54.9311812Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.9311816Z 2025-05-07T20:32:54.9311954Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9312174Z self=, 2025-05-07T20:32:54.9312251Z T=16384, 2025-05-07T20:32:54.9312323Z D=7168, 2025-05-07T20:32:54.9312411Z scale_ub=None, 2025-05-07T20:32:54.9312500Z contiguous=False, 2025-05-07T20:32:54.9312582Z compiled=False, 2025-05-07T20:32:54.9312655Z ) 2025-05-07T20:32:54.9312864Z self = 2025-05-07T20:32:54.9313034Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9313038Z 2025-05-07T20:32:54.9313114Z @given( 2025-05-07T20:32:54.9313227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9313322Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9313439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9313550Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9313663Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9313777Z ) 2025-05-07T20:32:54.9314017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9314113Z def test_silu_mul_quant( 2025-05-07T20:32:54.9314186Z self, 2025-05-07T20:32:54.9314260Z T: int, 2025-05-07T20:32:54.9314338Z D: int, 2025-05-07T20:32:54.9314433Z scale_ub: Optional[float], 2025-05-07T20:32:54.9314519Z contiguous: bool, 2025-05-07T20:32:54.9314605Z compiled: bool, 2025-05-07T20:32:54.9314680Z ) -> None: 2025-05-07T20:32:54.9314770Z torch.manual_seed(2025) 2025-05-07T20:32:54.9314843Z 2025-05-07T20:32:54.9315004Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9316820Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9316828Z 2025-05-07T20:32:54.9316939Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9316943Z 2025-05-07T20:32:54.9317046Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9317260Z self=, 2025-05-07T20:32:54.9317332Z T=2048, 2025-05-07T20:32:54.9317412Z D=7168, 2025-05-07T20:32:54.9317490Z scale_ub=1200.0, 2025-05-07T20:32:54.9317574Z contiguous=True, 2025-05-07T20:32:54.9317657Z compiled=True, 2025-05-07T20:32:54.9317728Z ) 2025-05-07T20:32:54.9317938Z self = 2025-05-07T20:32:54.9318107Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9318112Z 2025-05-07T20:32:54.9318184Z @given( 2025-05-07T20:32:54.9318344Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9318439Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9318549Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9318665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9318774Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9318844Z ) 2025-05-07T20:32:54.9319085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9319174Z def test_silu_mul_quant( 2025-05-07T20:32:54.9319251Z self, 2025-05-07T20:32:54.9319329Z T: int, 2025-05-07T20:32:54.9319402Z D: int, 2025-05-07T20:32:54.9319537Z scale_ub: Optional[float], 2025-05-07T20:32:54.9319632Z contiguous: bool, 2025-05-07T20:32:54.9319716Z compiled: bool, 2025-05-07T20:32:54.9319794Z ) -> None: 2025-05-07T20:32:54.9319886Z torch.manual_seed(2025) 2025-05-07T20:32:54.9319958Z 2025-05-07T20:32:54.9320123Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9320193Z 2025-05-07T20:32:54.9320281Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9320403Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9322165Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9322174Z 2025-05-07T20:32:54.9322289Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.9322296Z 2025-05-07T20:32:54.9322395Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9322611Z self=, 2025-05-07T20:32:54.9322689Z T=2048, 2025-05-07T20:32:54.9322763Z D=7168, 2025-05-07T20:32:54.9322844Z scale_ub=None, 2025-05-07T20:32:54.9322925Z contiguous=True, 2025-05-07T20:32:54.9323004Z compiled=False, 2025-05-07T20:32:54.9323079Z ) 2025-05-07T20:32:54.9323288Z self = 2025-05-07T20:32:54.9323450Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9323498Z 2025-05-07T20:32:54.9323573Z @given( 2025-05-07T20:32:54.9323691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9323788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9323904Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9324018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9324131Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9324202Z ) 2025-05-07T20:32:54.9324441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9324533Z def test_silu_mul_quant( 2025-05-07T20:32:54.9324606Z self, 2025-05-07T20:32:54.9324680Z T: int, 2025-05-07T20:32:54.9324757Z D: int, 2025-05-07T20:32:54.9324851Z scale_ub: Optional[float], 2025-05-07T20:32:54.9324941Z contiguous: bool, 2025-05-07T20:32:54.9325023Z compiled: bool, 2025-05-07T20:32:54.9325103Z ) -> None: 2025-05-07T20:32:54.9325194Z torch.manual_seed(2025) 2025-05-07T20:32:54.9325265Z 2025-05-07T20:32:54.9325432Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9325505Z 2025-05-07T20:32:54.9325593Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.9327307Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9327357Z 2025-05-07T20:32:54.9327470Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.9327478Z 2025-05-07T20:32:54.9327578Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9327831Z self=, 2025-05-07T20:32:54.9327909Z T=1, 2025-05-07T20:32:54.9327983Z D=7168, 2025-05-07T20:32:54.9328063Z scale_ub=1200.0, 2025-05-07T20:32:54.9328148Z contiguous=True, 2025-05-07T20:32:54.9328231Z compiled=False, 2025-05-07T20:32:54.9328300Z ) 2025-05-07T20:32:54.9328514Z self = 2025-05-07T20:32:54.9328671Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9328675Z 2025-05-07T20:32:54.9328750Z @given( 2025-05-07T20:32:54.9328866Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9328960Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9329073Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9329184Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9329296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9329370Z ) 2025-05-07T20:32:54.9329651Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9329742Z def test_silu_mul_quant( 2025-05-07T20:32:54.9329819Z self, 2025-05-07T20:32:54.9329892Z T: int, 2025-05-07T20:32:54.9329967Z D: int, 2025-05-07T20:32:54.9330065Z scale_ub: Optional[float], 2025-05-07T20:32:54.9330150Z contiguous: bool, 2025-05-07T20:32:54.9330231Z compiled: bool, 2025-05-07T20:32:54.9330311Z ) -> None: 2025-05-07T20:32:54.9330401Z torch.manual_seed(2025) 2025-05-07T20:32:54.9330476Z 2025-05-07T20:32:54.9330637Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9330707Z 2025-05-07T20:32:54.9330798Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9330917Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9331045Z x = x_sign * x_clamp 2025-05-07T20:32:54.9331125Z x0 = x[:, :D] 2025-05-07T20:32:54.9331206Z x1 = x[:, D:] 2025-05-07T20:32:54.9331276Z 2025-05-07T20:32:54.9331364Z if contiguous: 2025-05-07T20:32:54.9331451Z x0 = x0.contiguous() 2025-05-07T20:32:54.9331537Z x1 = x1.contiguous() 2025-05-07T20:32:54.9331612Z 2025-05-07T20:32:54.9331698Z if scale_ub is not None: 2025-05-07T20:32:54.9331805Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9331935Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9332010Z ) 2025-05-07T20:32:54.9332085Z else: 2025-05-07T20:32:54.9332175Z scale_ub_tensor = None 2025-05-07T20:32:54.9332244Z 2025-05-07T20:32:54.9332372Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9332459Z op = silu_mul_quant 2025-05-07T20:32:54.9332540Z if compiled: 2025-05-07T20:32:54.9332643Z op = torch.compile(op) 2025-05-07T20:32:54.9332746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9332817Z 2025-05-07T20:32:54.9332908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9332915Z 2025-05-07T20:32:54.9333010Z moe/activation_test.py:117: 2025-05-07T20:32:54.9333139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9333282Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9333378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9333955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9334051Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9334405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9334627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9335011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9335107Z kernel = self.compile( 2025-05-07T20:32:54.9335486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9335663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9335791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9335795Z 2025-05-07T20:32:54.9335995Z self = 2025-05-07T20:32:54.9336762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9337261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9ff62a0>} 2025-05-07T20:32:54.9338033Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9338230Z context = 2025-05-07T20:32:54.9338235Z 2025-05-07T20:32:54.9338397Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9338658Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9338762Z module_map=module_map) 2025-05-07T20:32:54.9338919Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9339017Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9339132Z E ^ 2025-05-07T20:32:54.9339478Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9339489Z 2025-05-07T20:32:54.9339896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9339901Z 2025-05-07T20:32:54.9340005Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9340226Z self=, 2025-05-07T20:32:54.9340299Z T=128, 2025-05-07T20:32:54.9340372Z D=5120, 2025-05-07T20:32:54.9340453Z scale_ub=None, 2025-05-07T20:32:54.9340534Z contiguous=True, 2025-05-07T20:32:54.9340615Z compiled=False, 2025-05-07T20:32:54.9340687Z ) 2025-05-07T20:32:54.9340900Z self = 2025-05-07T20:32:54.9341067Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9341075Z 2025-05-07T20:32:54.9341148Z @given( 2025-05-07T20:32:54.9341264Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9341366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9341480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9341593Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9341749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9341820Z ) 2025-05-07T20:32:54.9342062Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9342155Z def test_silu_mul_quant( 2025-05-07T20:32:54.9342229Z self, 2025-05-07T20:32:54.9342305Z T: int, 2025-05-07T20:32:54.9342377Z D: int, 2025-05-07T20:32:54.9342472Z scale_ub: Optional[float], 2025-05-07T20:32:54.9342563Z contiguous: bool, 2025-05-07T20:32:54.9342646Z compiled: bool, 2025-05-07T20:32:54.9342721Z ) -> None: 2025-05-07T20:32:54.9342818Z torch.manual_seed(2025) 2025-05-07T20:32:54.9342887Z 2025-05-07T20:32:54.9343089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9343167Z 2025-05-07T20:32:54.9343258Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9343378Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9343466Z x = x_sign * x_clamp 2025-05-07T20:32:54.9343546Z x0 = x[:, :D] 2025-05-07T20:32:54.9343626Z x1 = x[:, D:] 2025-05-07T20:32:54.9343697Z 2025-05-07T20:32:54.9343776Z if contiguous: 2025-05-07T20:32:54.9343866Z x0 = x0.contiguous() 2025-05-07T20:32:54.9343951Z x1 = x1.contiguous() 2025-05-07T20:32:54.9344019Z 2025-05-07T20:32:54.9344107Z if scale_ub is not None: 2025-05-07T20:32:54.9344209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9344339Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9344417Z ) 2025-05-07T20:32:54.9344490Z else: 2025-05-07T20:32:54.9344579Z scale_ub_tensor = None 2025-05-07T20:32:54.9344650Z 2025-05-07T20:32:54.9344778Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9344933Z op = silu_mul_quant 2025-05-07T20:32:54.9345016Z if compiled: 2025-05-07T20:32:54.9345112Z op = torch.compile(op) 2025-05-07T20:32:54.9345222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9345290Z 2025-05-07T20:32:54.9345379Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9345384Z 2025-05-07T20:32:54.9345481Z moe/activation_test.py:117: 2025-05-07T20:32:54.9345606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9345702Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9345802Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9346290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9346426Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9346782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9347000Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9347338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9347428Z kernel = self.compile( 2025-05-07T20:32:54.9347804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9347977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9348101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9348105Z 2025-05-07T20:32:54.9348311Z self = 2025-05-07T20:32:54.9349077Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9349577Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9ff71a0>} 2025-05-07T20:32:54.9350365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9350582Z context = 2025-05-07T20:32:54.9350588Z 2025-05-07T20:32:54.9350756Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9351010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9351119Z module_map=module_map) 2025-05-07T20:32:54.9351316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9351412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9351487Z E ^ 2025-05-07T20:32:54.9351832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9351839Z 2025-05-07T20:32:54.9352239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9352247Z 2025-05-07T20:32:54.9352345Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9352561Z self=, 2025-05-07T20:32:54.9352639Z T=128, 2025-05-07T20:32:54.9352712Z D=7168, 2025-05-07T20:32:54.9352790Z scale_ub=None, 2025-05-07T20:32:54.9352879Z contiguous=True, 2025-05-07T20:32:54.9352959Z compiled=False, 2025-05-07T20:32:54.9353029Z ) 2025-05-07T20:32:54.9353248Z self = 2025-05-07T20:32:54.9353447Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9353452Z 2025-05-07T20:32:54.9353529Z @given( 2025-05-07T20:32:54.9353647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9353744Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9353858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9353970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9354079Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9354150Z ) 2025-05-07T20:32:54.9354387Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9354477Z def test_silu_mul_quant( 2025-05-07T20:32:54.9354593Z self, 2025-05-07T20:32:54.9354667Z T: int, 2025-05-07T20:32:54.9354741Z D: int, 2025-05-07T20:32:54.9354841Z scale_ub: Optional[float], 2025-05-07T20:32:54.9354927Z contiguous: bool, 2025-05-07T20:32:54.9355015Z compiled: bool, 2025-05-07T20:32:54.9355089Z ) -> None: 2025-05-07T20:32:54.9355180Z torch.manual_seed(2025) 2025-05-07T20:32:54.9355254Z 2025-05-07T20:32:54.9355416Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9355486Z 2025-05-07T20:32:54.9355579Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9355699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9355783Z x = x_sign * x_clamp 2025-05-07T20:32:54.9355863Z x0 = x[:, :D] 2025-05-07T20:32:54.9355939Z x1 = x[:, D:] 2025-05-07T20:32:54.9356009Z 2025-05-07T20:32:54.9356093Z if contiguous: 2025-05-07T20:32:54.9356183Z x0 = x0.contiguous() 2025-05-07T20:32:54.9356272Z x1 = x1.contiguous() 2025-05-07T20:32:54.9356346Z 2025-05-07T20:32:54.9356432Z if scale_ub is not None: 2025-05-07T20:32:54.9356537Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9356669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9356742Z ) 2025-05-07T20:32:54.9356817Z else: 2025-05-07T20:32:54.9356954Z scale_ub_tensor = None 2025-05-07T20:32:54.9357023Z 2025-05-07T20:32:54.9357150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9357236Z op = silu_mul_quant 2025-05-07T20:32:54.9357317Z if compiled: 2025-05-07T20:32:54.9357415Z op = torch.compile(op) 2025-05-07T20:32:54.9357516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9357584Z 2025-05-07T20:32:54.9357675Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9357680Z 2025-05-07T20:32:54.9357774Z moe/activation_test.py:117: 2025-05-07T20:32:54.9357906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9358005Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9358140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9358630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9358726Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9359077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9359300Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9359633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9359727Z kernel = self.compile( 2025-05-07T20:32:54.9360103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9360276Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9360404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9360446Z 2025-05-07T20:32:54.9360647Z self = 2025-05-07T20:32:54.9361410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9361904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9b60040>} 2025-05-07T20:32:54.9362633Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9362867Z context = 2025-05-07T20:32:54.9362872Z 2025-05-07T20:32:54.9363034Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9363289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9363398Z module_map=module_map) 2025-05-07T20:32:54.9363557Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9363656Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9363730Z E ^ 2025-05-07T20:32:54.9364077Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9364081Z 2025-05-07T20:32:54.9364485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9364492Z 2025-05-07T20:32:54.9364591Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9364811Z self=, 2025-05-07T20:32:54.9364887Z T=2048, 2025-05-07T20:32:54.9364962Z D=7168, 2025-05-07T20:32:54.9365047Z scale_ub=1200.0, 2025-05-07T20:32:54.9365127Z contiguous=True, 2025-05-07T20:32:54.9365256Z compiled=False, 2025-05-07T20:32:54.9365326Z ) 2025-05-07T20:32:54.9365535Z self = 2025-05-07T20:32:54.9365704Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9365709Z 2025-05-07T20:32:54.9365782Z @given( 2025-05-07T20:32:54.9365895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9365993Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9366105Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9366218Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9366333Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9366406Z ) 2025-05-07T20:32:54.9366687Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9366778Z def test_silu_mul_quant( 2025-05-07T20:32:54.9366852Z self, 2025-05-07T20:32:54.9366933Z T: int, 2025-05-07T20:32:54.9367006Z D: int, 2025-05-07T20:32:54.9367100Z scale_ub: Optional[float], 2025-05-07T20:32:54.9367189Z contiguous: bool, 2025-05-07T20:32:54.9367272Z compiled: bool, 2025-05-07T20:32:54.9367348Z ) -> None: 2025-05-07T20:32:54.9367442Z torch.manual_seed(2025) 2025-05-07T20:32:54.9367511Z 2025-05-07T20:32:54.9367673Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9369449Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9369461Z 2025-05-07T20:32:54.9369577Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9369581Z 2025-05-07T20:32:54.9369680Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9369895Z self=, 2025-05-07T20:32:54.9369977Z T=1, 2025-05-07T20:32:54.9370050Z D=5120, 2025-05-07T20:32:54.9370129Z scale_ub=1200.0, 2025-05-07T20:32:54.9370214Z contiguous=True, 2025-05-07T20:32:54.9370294Z compiled=False, 2025-05-07T20:32:54.9370412Z ) 2025-05-07T20:32:54.9370664Z self = 2025-05-07T20:32:54.9370830Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9370835Z 2025-05-07T20:32:54.9370914Z @given( 2025-05-07T20:32:54.9371028Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9371125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9371240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9371351Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9371459Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9371534Z ) 2025-05-07T20:32:54.9371770Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9371859Z def test_silu_mul_quant( 2025-05-07T20:32:54.9371935Z self, 2025-05-07T20:32:54.9372009Z T: int, 2025-05-07T20:32:54.9372082Z D: int, 2025-05-07T20:32:54.9372182Z scale_ub: Optional[float], 2025-05-07T20:32:54.9372270Z contiguous: bool, 2025-05-07T20:32:54.9372358Z compiled: bool, 2025-05-07T20:32:54.9372435Z ) -> None: 2025-05-07T20:32:54.9372529Z torch.manual_seed(2025) 2025-05-07T20:32:54.9372601Z 2025-05-07T20:32:54.9372762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9372877Z 2025-05-07T20:32:54.9372970Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9373091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9373177Z x = x_sign * x_clamp 2025-05-07T20:32:54.9373257Z x0 = x[:, :D] 2025-05-07T20:32:54.9373334Z x1 = x[:, D:] 2025-05-07T20:32:54.9373403Z 2025-05-07T20:32:54.9373485Z if contiguous: 2025-05-07T20:32:54.9373572Z x0 = x0.contiguous() 2025-05-07T20:32:54.9373722Z x1 = x1.contiguous() 2025-05-07T20:32:54.9373793Z 2025-05-07T20:32:54.9373880Z if scale_ub is not None: 2025-05-07T20:32:54.9373988Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9374191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9374265Z ) 2025-05-07T20:32:54.9374347Z else: 2025-05-07T20:32:54.9374437Z scale_ub_tensor = None 2025-05-07T20:32:54.9374507Z 2025-05-07T20:32:54.9374637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9374728Z op = silu_mul_quant 2025-05-07T20:32:54.9374809Z if compiled: 2025-05-07T20:32:54.9374910Z op = torch.compile(op) 2025-05-07T20:32:54.9375012Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9375086Z 2025-05-07T20:32:54.9375174Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9375178Z 2025-05-07T20:32:54.9375271Z moe/activation_test.py:117: 2025-05-07T20:32:54.9375400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9375501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9375598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9376133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9376228Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9376583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9376805Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9377138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9377231Z kernel = self.compile( 2025-05-07T20:32:54.9377607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9377779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9377950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9377954Z 2025-05-07T20:32:54.9378155Z self = 2025-05-07T20:32:54.9378920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9379417Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9b61580>} 2025-05-07T20:32:54.9380150Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9380339Z context = 2025-05-07T20:32:54.9380346Z 2025-05-07T20:32:54.9380511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9380770Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9380873Z module_map=module_map) 2025-05-07T20:32:54.9381073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9381170Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9381243Z E ^ 2025-05-07T20:32:54.9381591Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9381596Z 2025-05-07T20:32:54.9381999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9382003Z 2025-05-07T20:32:54.9382102Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9382324Z self=, 2025-05-07T20:32:54.9382399Z T=2048, 2025-05-07T20:32:54.9382475Z D=5120, 2025-05-07T20:32:54.9382592Z scale_ub=None, 2025-05-07T20:32:54.9382678Z contiguous=True, 2025-05-07T20:32:54.9382763Z compiled=False, 2025-05-07T20:32:54.9382833Z ) 2025-05-07T20:32:54.9383044Z self = 2025-05-07T20:32:54.9383215Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9383219Z 2025-05-07T20:32:54.9383292Z @given( 2025-05-07T20:32:54.9383407Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9386402Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9386534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9386656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9386765Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9386844Z ) 2025-05-07T20:32:54.9387089Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9387187Z def test_silu_mul_quant( 2025-05-07T20:32:54.9387266Z self, 2025-05-07T20:32:54.9387405Z T: int, 2025-05-07T20:32:54.9387481Z D: int, 2025-05-07T20:32:54.9387581Z scale_ub: Optional[float], 2025-05-07T20:32:54.9387674Z contiguous: bool, 2025-05-07T20:32:54.9387756Z compiled: bool, 2025-05-07T20:32:54.9387834Z ) -> None: 2025-05-07T20:32:54.9387926Z torch.manual_seed(2025) 2025-05-07T20:32:54.9387996Z 2025-05-07T20:32:54.9388162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9388233Z 2025-05-07T20:32:54.9388321Z > x_sign = torch.sign(x) 2025-05-07T20:32:54.9390076Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9390127Z 2025-05-07T20:32:54.9390243Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:54.9390251Z 2025-05-07T20:32:54.9390349Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9390564Z self=, 2025-05-07T20:32:54.9390650Z T=16384, 2025-05-07T20:32:54.9390723Z D=5120, 2025-05-07T20:32:54.9390801Z scale_ub=None, 2025-05-07T20:32:54.9390885Z contiguous=True, 2025-05-07T20:32:54.9390965Z compiled=False, 2025-05-07T20:32:54.9391036Z ) 2025-05-07T20:32:54.9391248Z self = 2025-05-07T20:32:54.9391424Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9391429Z 2025-05-07T20:32:54.9391502Z @given( 2025-05-07T20:32:54.9391621Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9391718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9391876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9391989Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9392097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9392171Z ) 2025-05-07T20:32:54.9392409Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9392500Z def test_silu_mul_quant( 2025-05-07T20:32:54.9392578Z self, 2025-05-07T20:32:54.9392652Z T: int, 2025-05-07T20:32:54.9392726Z D: int, 2025-05-07T20:32:54.9392822Z scale_ub: Optional[float], 2025-05-07T20:32:54.9392912Z contiguous: bool, 2025-05-07T20:32:54.9392996Z compiled: bool, 2025-05-07T20:32:54.9393071Z ) -> None: 2025-05-07T20:32:54.9393202Z torch.manual_seed(2025) 2025-05-07T20:32:54.9393279Z 2025-05-07T20:32:54.9393438Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9395164Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9395177Z 2025-05-07T20:32:54.9395295Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9395299Z 2025-05-07T20:32:54.9395396Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9395656Z self=, 2025-05-07T20:32:54.9395731Z T=4096, 2025-05-07T20:32:54.9395804Z D=5120, 2025-05-07T20:32:54.9395891Z scale_ub=None, 2025-05-07T20:32:54.9395975Z contiguous=True, 2025-05-07T20:32:54.9396055Z compiled=False, 2025-05-07T20:32:54.9396131Z ) 2025-05-07T20:32:54.9396341Z self = 2025-05-07T20:32:54.9396511Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9396515Z 2025-05-07T20:32:54.9396590Z @given( 2025-05-07T20:32:54.9396704Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9396803Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9396914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9397072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9397187Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9397261Z ) 2025-05-07T20:32:54.9397507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9397597Z def test_silu_mul_quant( 2025-05-07T20:32:54.9397671Z self, 2025-05-07T20:32:54.9397754Z T: int, 2025-05-07T20:32:54.9397828Z D: int, 2025-05-07T20:32:54.9397922Z scale_ub: Optional[float], 2025-05-07T20:32:54.9398010Z contiguous: bool, 2025-05-07T20:32:54.9398090Z compiled: bool, 2025-05-07T20:32:54.9398572Z ) -> None: 2025-05-07T20:32:54.9398682Z torch.manual_seed(2025) 2025-05-07T20:32:54.9398753Z 2025-05-07T20:32:54.9398935Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9401201Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9401305Z 2025-05-07T20:32:54.9401419Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9401428Z 2025-05-07T20:32:54.9401529Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9401743Z self=, 2025-05-07T20:32:54.9401824Z T=2048, 2025-05-07T20:32:54.9401899Z D=5120, 2025-05-07T20:32:54.9401978Z scale_ub=None, 2025-05-07T20:32:54.9402064Z contiguous=False, 2025-05-07T20:32:54.9402145Z compiled=False, 2025-05-07T20:32:54.9402223Z ) 2025-05-07T20:32:54.9402433Z self = 2025-05-07T20:32:54.9402659Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9402667Z 2025-05-07T20:32:54.9402744Z @given( 2025-05-07T20:32:54.9402859Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9402957Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9403071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9403182Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9403293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9403364Z ) 2025-05-07T20:32:54.9403600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9403693Z def test_silu_mul_quant( 2025-05-07T20:32:54.9403766Z self, 2025-05-07T20:32:54.9403846Z T: int, 2025-05-07T20:32:54.9403924Z D: int, 2025-05-07T20:32:54.9404019Z scale_ub: Optional[float], 2025-05-07T20:32:54.9404103Z contiguous: bool, 2025-05-07T20:32:54.9404190Z compiled: bool, 2025-05-07T20:32:54.9404267Z ) -> None: 2025-05-07T20:32:54.9404422Z torch.manual_seed(2025) 2025-05-07T20:32:54.9404496Z 2025-05-07T20:32:54.9404656Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9406377Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9406438Z 2025-05-07T20:32:54.9406551Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9406555Z 2025-05-07T20:32:54.9406658Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9406875Z self=, 2025-05-07T20:32:54.9406949Z T=4096, 2025-05-07T20:32:54.9407026Z D=7168, 2025-05-07T20:32:54.9407107Z scale_ub=None, 2025-05-07T20:32:54.9407189Z contiguous=True, 2025-05-07T20:32:54.9407271Z compiled=True, 2025-05-07T20:32:54.9407340Z ) 2025-05-07T20:32:54.9407549Z self = 2025-05-07T20:32:54.9407713Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.9407717Z 2025-05-07T20:32:54.9407790Z @given( 2025-05-07T20:32:54.9407907Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9408000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9408115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9408229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9408340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9408415Z ) 2025-05-07T20:32:54.9408654Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9408814Z def test_silu_mul_quant( 2025-05-07T20:32:54.9408889Z self, 2025-05-07T20:32:54.9408967Z T: int, 2025-05-07T20:32:54.9409040Z D: int, 2025-05-07T20:32:54.9409136Z scale_ub: Optional[float], 2025-05-07T20:32:54.9409222Z contiguous: bool, 2025-05-07T20:32:54.9409304Z compiled: bool, 2025-05-07T20:32:54.9409381Z ) -> None: 2025-05-07T20:32:54.9409474Z torch.manual_seed(2025) 2025-05-07T20:32:54.9409543Z 2025-05-07T20:32:54.9409710Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9411472Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9411485Z 2025-05-07T20:32:54.9411601Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9411605Z 2025-05-07T20:32:54.9411705Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9411920Z self=, 2025-05-07T20:32:54.9411998Z T=2048, 2025-05-07T20:32:54.9412071Z D=5120, 2025-05-07T20:32:54.9412154Z scale_ub=1200.0, 2025-05-07T20:32:54.9412241Z contiguous=False, 2025-05-07T20:32:54.9412322Z compiled=False, 2025-05-07T20:32:54.9412395Z ) 2025-05-07T20:32:54.9412608Z self = 2025-05-07T20:32:54.9412814Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9412818Z 2025-05-07T20:32:54.9412894Z @given( 2025-05-07T20:32:54.9413010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9413104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9413216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9413328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9413439Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9413509Z ) 2025-05-07T20:32:54.9413852Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9413945Z def test_silu_mul_quant( 2025-05-07T20:32:54.9414085Z self, 2025-05-07T20:32:54.9414160Z T: int, 2025-05-07T20:32:54.9414236Z D: int, 2025-05-07T20:32:54.9414333Z scale_ub: Optional[float], 2025-05-07T20:32:54.9414418Z contiguous: bool, 2025-05-07T20:32:54.9414504Z compiled: bool, 2025-05-07T20:32:54.9414580Z ) -> None: 2025-05-07T20:32:54.9414671Z torch.manual_seed(2025) 2025-05-07T20:32:54.9414745Z 2025-05-07T20:32:54.9414907Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9416627Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9416635Z 2025-05-07T20:32:54.9416753Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9416757Z 2025-05-07T20:32:54.9416861Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9417076Z self=, 2025-05-07T20:32:54.9417193Z T=4096, 2025-05-07T20:32:54.9417268Z D=7168, 2025-05-07T20:32:54.9417347Z scale_ub=1200.0, 2025-05-07T20:32:54.9417428Z contiguous=True, 2025-05-07T20:32:54.9417510Z compiled=False, 2025-05-07T20:32:54.9417579Z ) 2025-05-07T20:32:54.9417789Z self = 2025-05-07T20:32:54.9417956Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9417960Z 2025-05-07T20:32:54.9418032Z @given( 2025-05-07T20:32:54.9418147Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9418245Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9418355Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9418514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9418627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9418698Z ) 2025-05-07T20:32:54.9418941Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9419034Z def test_silu_mul_quant( 2025-05-07T20:32:54.9419107Z self, 2025-05-07T20:32:54.9419185Z T: int, 2025-05-07T20:32:54.9419259Z D: int, 2025-05-07T20:32:54.9419357Z scale_ub: Optional[float], 2025-05-07T20:32:54.9419442Z contiguous: bool, 2025-05-07T20:32:54.9419524Z compiled: bool, 2025-05-07T20:32:54.9419602Z ) -> None: 2025-05-07T20:32:54.9419691Z torch.manual_seed(2025) 2025-05-07T20:32:54.9419764Z 2025-05-07T20:32:54.9419925Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9421690Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9421698Z 2025-05-07T20:32:54.9421815Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9421820Z 2025-05-07T20:32:54.9421918Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9422132Z self=, 2025-05-07T20:32:54.9422210Z T=16384, 2025-05-07T20:32:54.9422323Z D=7168, 2025-05-07T20:32:54.9422405Z scale_ub=None, 2025-05-07T20:32:54.9422487Z contiguous=False, 2025-05-07T20:32:54.9422568Z compiled=True, 2025-05-07T20:32:54.9422646Z ) 2025-05-07T20:32:54.9422858Z self = 2025-05-07T20:32:54.9423026Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:54.9423033Z 2025-05-07T20:32:54.9423108Z @given( 2025-05-07T20:32:54.9423220Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9423314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9423426Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9423538Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9423654Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9423725Z ) 2025-05-07T20:32:54.9423960Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9424055Z def test_silu_mul_quant( 2025-05-07T20:32:54.9424128Z self, 2025-05-07T20:32:54.9424203Z T: int, 2025-05-07T20:32:54.9424282Z D: int, 2025-05-07T20:32:54.9424379Z scale_ub: Optional[float], 2025-05-07T20:32:54.9424464Z contiguous: bool, 2025-05-07T20:32:54.9424550Z compiled: bool, 2025-05-07T20:32:54.9424625Z ) -> None: 2025-05-07T20:32:54.9424760Z torch.manual_seed(2025) 2025-05-07T20:32:54.9424832Z 2025-05-07T20:32:54.9424991Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9426755Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9426763Z 2025-05-07T20:32:54.9426879Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9426883Z 2025-05-07T20:32:54.9426984Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9427200Z self=, 2025-05-07T20:32:54.9427274Z T=4096, 2025-05-07T20:32:54.9427353Z D=7168, 2025-05-07T20:32:54.9427433Z scale_ub=None, 2025-05-07T20:32:54.9427514Z contiguous=True, 2025-05-07T20:32:54.9427602Z compiled=False, 2025-05-07T20:32:54.9427671Z ) 2025-05-07T20:32:54.9427879Z self = 2025-05-07T20:32:54.9428047Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9428052Z 2025-05-07T20:32:54.9428124Z @given( 2025-05-07T20:32:54.9428245Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9428339Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9428451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9428606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9428716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9428789Z ) 2025-05-07T20:32:54.9429029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9429118Z def test_silu_mul_quant( 2025-05-07T20:32:54.9429191Z self, 2025-05-07T20:32:54.9429266Z T: int, 2025-05-07T20:32:54.9429339Z D: int, 2025-05-07T20:32:54.9429436Z scale_ub: Optional[float], 2025-05-07T20:32:54.9429525Z contiguous: bool, 2025-05-07T20:32:54.9429606Z compiled: bool, 2025-05-07T20:32:54.9429682Z ) -> None: 2025-05-07T20:32:54.9429771Z torch.manual_seed(2025) 2025-05-07T20:32:54.9429884Z 2025-05-07T20:32:54.9430046Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9431765Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9431773Z 2025-05-07T20:32:54.9431887Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9431891Z 2025-05-07T20:32:54.9431989Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9432202Z self=, 2025-05-07T20:32:54.9432281Z T=16384, 2025-05-07T20:32:54.9432355Z D=7168, 2025-05-07T20:32:54.9432435Z scale_ub=None, 2025-05-07T20:32:54.9432517Z contiguous=True, 2025-05-07T20:32:54.9432598Z compiled=False, 2025-05-07T20:32:54.9432674Z ) 2025-05-07T20:32:54.9432886Z self = 2025-05-07T20:32:54.9433095Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:54.9433099Z 2025-05-07T20:32:54.9433174Z @given( 2025-05-07T20:32:54.9433286Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9433380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9433497Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9433608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9433718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9433787Z ) 2025-05-07T20:32:54.9434022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9434118Z def test_silu_mul_quant( 2025-05-07T20:32:54.9434231Z self, 2025-05-07T20:32:54.9434306Z T: int, 2025-05-07T20:32:54.9434386Z D: int, 2025-05-07T20:32:54.9434480Z scale_ub: Optional[float], 2025-05-07T20:32:54.9434566Z contiguous: bool, 2025-05-07T20:32:54.9434652Z compiled: bool, 2025-05-07T20:32:54.9434727Z ) -> None: 2025-05-07T20:32:54.9434817Z torch.manual_seed(2025) 2025-05-07T20:32:54.9434890Z 2025-05-07T20:32:54.9435048Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9436810Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9436819Z 2025-05-07T20:32:54.9436933Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9436937Z 2025-05-07T20:32:54.9437040Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9437255Z self=, 2025-05-07T20:32:54.9437328Z T=16384, 2025-05-07T20:32:54.9437402Z D=7168, 2025-05-07T20:32:54.9437480Z scale_ub=1200.0, 2025-05-07T20:32:54.9437563Z contiguous=True, 2025-05-07T20:32:54.9437647Z compiled=False, 2025-05-07T20:32:54.9437716Z ) 2025-05-07T20:32:54.9437927Z self = 2025-05-07T20:32:54.9438097Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9438165Z 2025-05-07T20:32:54.9438238Z @given( 2025-05-07T20:32:54.9438357Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9438452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9438564Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9438679Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9438790Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9438860Z ) 2025-05-07T20:32:54.9439097Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9439185Z def test_silu_mul_quant( 2025-05-07T20:32:54.9439258Z self, 2025-05-07T20:32:54.9439335Z T: int, 2025-05-07T20:32:54.9439408Z D: int, 2025-05-07T20:32:54.9439505Z scale_ub: Optional[float], 2025-05-07T20:32:54.9439590Z contiguous: bool, 2025-05-07T20:32:54.9439671Z compiled: bool, 2025-05-07T20:32:54.9439750Z ) -> None: 2025-05-07T20:32:54.9439840Z torch.manual_seed(2025) 2025-05-07T20:32:54.9439909Z 2025-05-07T20:32:54.9440076Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9441798Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9441849Z 2025-05-07T20:32:54.9441966Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9441970Z 2025-05-07T20:32:54.9442069Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9442286Z self=, 2025-05-07T20:32:54.9442361Z T=128, 2025-05-07T20:32:54.9442473Z D=5120, 2025-05-07T20:32:54.9442555Z scale_ub=1200.0, 2025-05-07T20:32:54.9442640Z contiguous=False, 2025-05-07T20:32:54.9442722Z compiled=False, 2025-05-07T20:32:54.9442794Z ) 2025-05-07T20:32:54.9443003Z self = 2025-05-07T20:32:54.9443170Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:54.9443174Z 2025-05-07T20:32:54.9443249Z @given( 2025-05-07T20:32:54.9443361Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9443456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9443571Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9443683Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9443793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9443867Z ) 2025-05-07T20:32:54.9444101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9444196Z def test_silu_mul_quant( 2025-05-07T20:32:54.9444311Z self, 2025-05-07T20:32:54.9444386Z T: int, 2025-05-07T20:32:54.9444462Z D: int, 2025-05-07T20:32:54.9444557Z scale_ub: Optional[float], 2025-05-07T20:32:54.9444644Z contiguous: bool, 2025-05-07T20:32:54.9444729Z compiled: bool, 2025-05-07T20:32:54.9444805Z ) -> None: 2025-05-07T20:32:54.9444895Z torch.manual_seed(2025) 2025-05-07T20:32:54.9444967Z 2025-05-07T20:32:54.9445126Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9445199Z 2025-05-07T20:32:54.9445289Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9445409Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9445497Z x = x_sign * x_clamp 2025-05-07T20:32:54.9445616Z x0 = x[:, :D] 2025-05-07T20:32:54.9445694Z x1 = x[:, D:] 2025-05-07T20:32:54.9445768Z 2025-05-07T20:32:54.9445848Z if contiguous: 2025-05-07T20:32:54.9445938Z x0 = x0.contiguous() 2025-05-07T20:32:54.9446031Z x1 = x1.contiguous() 2025-05-07T20:32:54.9446100Z 2025-05-07T20:32:54.9446186Z if scale_ub is not None: 2025-05-07T20:32:54.9446299Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9446432Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9446504Z ) 2025-05-07T20:32:54.9446576Z else: 2025-05-07T20:32:54.9446668Z scale_ub_tensor = None 2025-05-07T20:32:54.9446737Z 2025-05-07T20:32:54.9446860Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9446948Z op = silu_mul_quant 2025-05-07T20:32:54.9447028Z if compiled: 2025-05-07T20:32:54.9447127Z op = torch.compile(op) 2025-05-07T20:32:54.9447231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9447300Z 2025-05-07T20:32:54.9447389Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9447396Z 2025-05-07T20:32:54.9447489Z moe/activation_test.py:117: 2025-05-07T20:32:54.9447620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9447720Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9447861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9448354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9448451Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9448804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9449024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9449358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9449451Z kernel = self.compile( 2025-05-07T20:32:54.9449873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9450043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9450178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9450183Z 2025-05-07T20:32:54.9450380Z self = 2025-05-07T20:32:54.9451142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9451638Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9e951c0>} 2025-05-07T20:32:54.9452410Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9452600Z context = 2025-05-07T20:32:54.9452607Z 2025-05-07T20:32:54.9452770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9453024Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9453132Z module_map=module_map) 2025-05-07T20:32:54.9453289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9453387Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9453462Z E ^ 2025-05-07T20:32:54.9453900Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9453946Z 2025-05-07T20:32:54.9454356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9454364Z 2025-05-07T20:32:54.9454463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9454682Z self=, 2025-05-07T20:32:54.9454758Z T=2048, 2025-05-07T20:32:54.9454830Z D=7168, 2025-05-07T20:32:54.9454912Z scale_ub=None, 2025-05-07T20:32:54.9454995Z contiguous=False, 2025-05-07T20:32:54.9455076Z compiled=False, 2025-05-07T20:32:54.9455149Z ) 2025-05-07T20:32:54.9455360Z self = 2025-05-07T20:32:54.9455526Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:54.9455531Z 2025-05-07T20:32:54.9455607Z @given( 2025-05-07T20:32:54.9455725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9455826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9455945Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9456061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9456174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9456245Z ) 2025-05-07T20:32:54.9456526Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9456619Z def test_silu_mul_quant( 2025-05-07T20:32:54.9456692Z self, 2025-05-07T20:32:54.9456767Z T: int, 2025-05-07T20:32:54.9456846Z D: int, 2025-05-07T20:32:54.9456941Z scale_ub: Optional[float], 2025-05-07T20:32:54.9457029Z contiguous: bool, 2025-05-07T20:32:54.9457114Z compiled: bool, 2025-05-07T20:32:54.9457189Z ) -> None: 2025-05-07T20:32:54.9457282Z torch.manual_seed(2025) 2025-05-07T20:32:54.9457352Z 2025-05-07T20:32:54.9457517Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9459293Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9459302Z 2025-05-07T20:32:54.9459418Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9459422Z 2025-05-07T20:32:54.9459524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9459740Z self=, 2025-05-07T20:32:54.9459817Z T=128, 2025-05-07T20:32:54.9459893Z D=7168, 2025-05-07T20:32:54.9459972Z scale_ub=1200.0, 2025-05-07T20:32:54.9460054Z contiguous=True, 2025-05-07T20:32:54.9460137Z compiled=True, 2025-05-07T20:32:54.9460206Z ) 2025-05-07T20:32:54.9460458Z self = 2025-05-07T20:32:54.9460617Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9460625Z 2025-05-07T20:32:54.9460698Z @given( 2025-05-07T20:32:54.9460817Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9460911Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9461021Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9461135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9461244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9461314Z ) 2025-05-07T20:32:54.9461555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9461685Z def test_silu_mul_quant( 2025-05-07T20:32:54.9461762Z self, 2025-05-07T20:32:54.9461840Z T: int, 2025-05-07T20:32:54.9461913Z D: int, 2025-05-07T20:32:54.9462014Z scale_ub: Optional[float], 2025-05-07T20:32:54.9462099Z contiguous: bool, 2025-05-07T20:32:54.9462181Z compiled: bool, 2025-05-07T20:32:54.9462261Z ) -> None: 2025-05-07T20:32:54.9462350Z torch.manual_seed(2025) 2025-05-07T20:32:54.9462419Z 2025-05-07T20:32:54.9462584Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9462654Z 2025-05-07T20:32:54.9462743Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9462868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9462953Z x = x_sign * x_clamp 2025-05-07T20:32:54.9463034Z x0 = x[:, :D] 2025-05-07T20:32:54.9463112Z x1 = x[:, D:] 2025-05-07T20:32:54.9463180Z 2025-05-07T20:32:54.9463266Z if contiguous: 2025-05-07T20:32:54.9463353Z x0 = x0.contiguous() 2025-05-07T20:32:54.9463441Z x1 = x1.contiguous() 2025-05-07T20:32:54.9463516Z 2025-05-07T20:32:54.9463606Z if scale_ub is not None: 2025-05-07T20:32:54.9463707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:54.9463841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:54.9463958Z ) 2025-05-07T20:32:54.9464034Z else: 2025-05-07T20:32:54.9464128Z scale_ub_tensor = None 2025-05-07T20:32:54.9464197Z 2025-05-07T20:32:54.9464320Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:54.9464410Z op = silu_mul_quant 2025-05-07T20:32:54.9464492Z if compiled: 2025-05-07T20:32:54.9464590Z op = torch.compile(op) 2025-05-07T20:32:54.9464690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9464759Z 2025-05-07T20:32:54.9464847Z > y_fp8, y_scale = fn() 2025-05-07T20:32:54.9464855Z 2025-05-07T20:32:54.9464948Z moe/activation_test.py:117: 2025-05-07T20:32:54.9465114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9465219Z moe/activation_test.py:115: in fn 2025-05-07T20:32:54.9465314Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:54.9465675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:54.9465769Z return fn(*args, **kwargs) 2025-05-07T20:32:54.9466252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:54.9466349Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:54.9466701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:54.9466915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:54.9467254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:54.9467346Z kernel = self.compile( 2025-05-07T20:32:54.9467791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:54.9467964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:54.9468090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:54.9468094Z 2025-05-07T20:32:54.9468296Z self = 2025-05-07T20:32:54.9469057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:54.9469554Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7fe5b9de7b00>} 2025-05-07T20:32:54.9470333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:54.9470523Z context = 2025-05-07T20:32:54.9470532Z 2025-05-07T20:32:54.9470692Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:54.9470945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:54.9471052Z module_map=module_map) 2025-05-07T20:32:54.9471210Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:54.9471304Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:54.9471381Z E ^ 2025-05-07T20:32:54.9471729Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:54.9471734Z 2025-05-07T20:32:54.9472143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:54.9472147Z 2025-05-07T20:32:54.9472247Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9472504Z self=, 2025-05-07T20:32:54.9472582Z T=128, 2025-05-07T20:32:54.9472656Z D=7168, 2025-05-07T20:32:54.9472736Z scale_ub=1200.0, 2025-05-07T20:32:54.9472820Z contiguous=True, 2025-05-07T20:32:54.9472899Z compiled=False, 2025-05-07T20:32:54.9472970Z ) 2025-05-07T20:32:54.9473186Z self = 2025-05-07T20:32:54.9473346Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:54.9473351Z 2025-05-07T20:32:54.9473429Z @given( 2025-05-07T20:32:54.9473543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9473677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9473794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9473906Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9474014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9474090Z ) 2025-05-07T20:32:54.9474327Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9474416Z def test_silu_mul_quant( 2025-05-07T20:32:54.9474492Z self, 2025-05-07T20:32:54.9474566Z T: int, 2025-05-07T20:32:54.9474641Z D: int, 2025-05-07T20:32:54.9474735Z scale_ub: Optional[float], 2025-05-07T20:32:54.9474821Z contiguous: bool, 2025-05-07T20:32:54.9474905Z compiled: bool, 2025-05-07T20:32:54.9474980Z ) -> None: 2025-05-07T20:32:54.9475071Z torch.manual_seed(2025) 2025-05-07T20:32:54.9475147Z 2025-05-07T20:32:54.9475305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9475377Z 2025-05-07T20:32:54.9475510Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9475632Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9477358Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9477367Z 2025-05-07T20:32:54.9477481Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.9477523Z 2025-05-07T20:32:54.9477624Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9477841Z self=, 2025-05-07T20:32:54.9477917Z T=128, 2025-05-07T20:32:54.9477995Z D=5120, 2025-05-07T20:32:54.9478075Z scale_ub=1200.0, 2025-05-07T20:32:54.9478155Z contiguous=True, 2025-05-07T20:32:54.9478241Z compiled=True, 2025-05-07T20:32:54.9478311Z ) 2025-05-07T20:32:54.9478520Z self = 2025-05-07T20:32:54.9478681Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:54.9478685Z 2025-05-07T20:32:54.9478757Z @given( 2025-05-07T20:32:54.9478875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9478970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9479083Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9479201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9479309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9479380Z ) 2025-05-07T20:32:54.9479622Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9479712Z def test_silu_mul_quant( 2025-05-07T20:32:54.9479786Z self, 2025-05-07T20:32:54.9479910Z T: int, 2025-05-07T20:32:54.9479984Z D: int, 2025-05-07T20:32:54.9480080Z scale_ub: Optional[float], 2025-05-07T20:32:54.9480167Z contiguous: bool, 2025-05-07T20:32:54.9480249Z compiled: bool, 2025-05-07T20:32:54.9480326Z ) -> None: 2025-05-07T20:32:54.9480418Z torch.manual_seed(2025) 2025-05-07T20:32:54.9480488Z 2025-05-07T20:32:54.9480650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9480719Z 2025-05-07T20:32:54.9480806Z x_sign = torch.sign(x) 2025-05-07T20:32:54.9480929Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:54.9482687Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9482697Z 2025-05-07T20:32:54.9482813Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:54.9482817Z 2025-05-07T20:32:54.9482914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:54.9483128Z self=, 2025-05-07T20:32:54.9483204Z T=128, 2025-05-07T20:32:54.9483277Z D=7168, 2025-05-07T20:32:54.9483362Z scale_ub=None, 2025-05-07T20:32:54.9483443Z contiguous=True, 2025-05-07T20:32:54.9483522Z compiled=True, 2025-05-07T20:32:54.9483596Z ) 2025-05-07T20:32:54.9483842Z self = 2025-05-07T20:32:54.9484002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:54.9484009Z 2025-05-07T20:32:54.9484087Z @given( 2025-05-07T20:32:54.9484200Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:54.9484294Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:54.9484407Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:54.9484519Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:54.9484630Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:54.9484700Z ) 2025-05-07T20:32:54.9484937Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:54.9485071Z def test_silu_mul_quant( 2025-05-07T20:32:54.9485145Z self, 2025-05-07T20:32:54.9485218Z T: int, 2025-05-07T20:32:54.9485294Z D: int, 2025-05-07T20:32:54.9485392Z scale_ub: Optional[float], 2025-05-07T20:32:54.9485479Z contiguous: bool, 2025-05-07T20:32:54.9485565Z compiled: bool, 2025-05-07T20:32:54.9485639Z ) -> None: 2025-05-07T20:32:54.9485734Z torch.manual_seed(2025) 2025-05-07T20:32:54.9485806Z 2025-05-07T20:32:54.9485965Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:54.9487684Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:54.9487695Z 2025-05-07T20:32:54.9487810Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:54.9487944Z =============================== warnings summary =============================== 2025-05-07T20:32:54.9488290Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.9488585Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.9488882Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:54.9489738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:54.9489967Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:54.9490011Z 2025-05-07T20:32:54.9490217Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:54.9490379Z ================= 1 failed, 1 deselected, 3 warnings in 11.97s ================= 2025-05-07T20:32:56.5825320Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:56.6460795Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:32:56.6461036Z 2025-05-07T20:32:56.6461215Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:32:56.6461779Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:32:56.6462168Z 2025-05-07T20:32:56.6462196Z 2025-05-07T20:32:56.6462200Z 2025-05-07T20:32:56.6478554Z ##[error]Process completed with exit code 1. 2025-05-07T20:32:56.6564551Z Post job cleanup. 2025-05-07T20:32:56.7552465Z [command]/usr/bin/git version 2025-05-07T20:32:56.7595562Z git version 2.47.1 2025-05-07T20:32:56.7635261Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/e14a5568-deff-401d-b484-86b49e6546de/.gitconfig' 2025-05-07T20:32:56.7646407Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e14a5568-deff-401d-b484-86b49e6546de' before making global git config changes 2025-05-07T20:32:56.7647247Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:32:56.7660952Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:32:56.7705556Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:32:56.7741060Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:32:56.8082602Z Entering 'external/asmjit' 2025-05-07T20:32:56.8149927Z Entering 'external/composable_kernel' 2025-05-07T20:32:56.8223351Z Entering 'external/cpuinfo' 2025-05-07T20:32:56.8292555Z Entering 'external/cutlass' 2025-05-07T20:32:56.8368889Z Entering 'external/googletest' 2025-05-07T20:32:56.8436192Z Entering 'external/hipify_torch' 2025-05-07T20:32:56.8503537Z Entering 'external/json' 2025-05-07T20:32:56.8594277Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:32:56.8620639Z http.https://github.com/.extraheader 2025-05-07T20:32:56.8632328Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:32:56.8665772Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:32:56.8995337Z Entering 'external/asmjit' 2025-05-07T20:32:56.9041570Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9084733Z Entering 'external/composable_kernel' 2025-05-07T20:32:56.9131158Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9180691Z Entering 'external/cpuinfo' 2025-05-07T20:32:56.9223543Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9268039Z Entering 'external/cutlass' 2025-05-07T20:32:56.9313840Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9364906Z Entering 'external/googletest' 2025-05-07T20:32:56.9407150Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9449748Z Entering 'external/hipify_torch' 2025-05-07T20:32:56.9493071Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9536136Z Entering 'external/json' 2025-05-07T20:32:56.9579294Z http.https://github.com/.extraheader 2025-05-07T20:32:56.9729851Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:32:56.9757674Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:32:56.9768342Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:32:56.9768700Z ##[endgroup] 2025-05-07T20:32:56.9871866Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:33:07.7867180Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:33:24.3653060Z Cleaning up orphan processes